Yves Normandin - AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo

La nouvelle RVI arrive en ville. Qu’est-ce que ça signifie?

Yves Normandin — Wed, 05 May 2021 14:07:25 +0000

Et en fait, ça n’a rien de bien nouveau. Nous appelons habituellement ça « RVI avec reconnaissance vocale » et c’est ce type d’expériences conversationnelles que nous créons depuis 20 ans.

Ce qui est nouveau, c’est qu’il existe désormais de nouvelles technologies et de nouvelles plateformes qui promettent d’accélérer et de faciliter la création de ce type d’expériences conversationnelles tout en bonifiant considérablement l’éventail des tâches que les agents virtuels vocaux (comme nous les appelons) peuvent exécuter.

Ces nouvelles technologies ont initialement vu le jour chez les assistants vocaux (Siri, Amazon Echo, Google Home) et sont en train de changer fondamentalement la façon dont les solutions RVI sont développées.

Pour comprendre comment, comparons la « RVI traditionnelle avec reconnaissance vocale » avec cette « nouvelle RVI ».

Technologie	RVI traditionnelle avec reconnaissance vocale	Nouvelle RVI
Reconnaissance vocale	Grammaires et modèles de langage statistiques	Transcription automatique de la parole (speech-to-text)
Compréhension du langage naturel (CLN/NLU)	Grammaires et classificateurs simples	Traitement automatique du langage naturel (TALN) par apprentissage profond (deep learning)
Synthèse de la parole	Concaténation de segment vocaux + synthèse vocale (TTS)	Synthèse vocale (TTS), principalement

Regardons tout cela plus en détails.

RVI traditionnelle avec reconnaissance vocale

Les engins de reconnaissance vocale traditionnellement utilisés dans les RVI (par exemple, Nuance Recognizer) ne sont pas en mesure de fonctionner “out-of-the-box”. Pour cela, on doit faire appel à des grammaires de reconnaissance vocale. Il existe deux types principaux de grammaires :

Les grammaires SRGS sont définies par un ensemble de règles, élaborées manuellement par un développeur de grammaires. Elles fournissent une description formelle des énoncés qui peuvent être reconnus par l’engin de reconnaissance. Le langage défini par les grammaires SRGS est rigide et seuls les énoncés prévus dans ces grammaires peuvent être reconnus par l’engin. Les grammaires SRGS sont bien adaptées aux dialogues dirigés, qui présentent typiquement un ensemble prévisible d’énoncés qui seront fournis par l’utilisateur.
Les modèles de langage statistique (SLM) sont définis par des N-grammes; ce sont les probabilités d’occurrence un mot étant donnés les mots précédents dans la phrase, probabilités apprises à partir d’un échantillon de phrases. Les modèles SLM fournissent un modèle de langage beaucoup moins rigide que les grammaires SRGS et sont donc beaucoup mieux adaptés pour traiter les réponses aux questions ouvertes (par exemple, « Comment puis-je vous aider? »), réponses normalement plus spontanées, et fournies par l’utilisateur en langage naturel. Pour bien performer, les modèles SLM doivent pouvoir s’entraîner sur un corpus de phrases suffisamment grand et représentatif du domaine ciblé.

Le développement d’une application RVI traditionnelle avec reconnaissance vocale nécessite la création d’une grammaire distincte pour chacune des étapes du dialogue. De plus, pour atteindre un niveau de précision de reconnaissance suffisant, ces grammaires doivent subir de nombreux réglages basés sur des énoncés d’utilisateurs réels, qui auront été collectés par l’application RVI en production.

Le développement et le réglage de ces grammaires sont des tâches chronophages qui nécessitent l’intervention de spécialistes de la parole hautement qualifiés. Si cela est bien fait, on peut atteindre une très grande précision dans la reconnaissance et créer des expériences utilisateur positives. Malheureusement, ces tâches sont trop souvent négligées, ce qui se traduit inévitablement par des performances de l’engin de reconnaissance décevantes, ce qui à son tour entraîne une expérience utilisateur médiocre. C’est d’ailleurs une des principales raisons pour lesquelles les RVI avec reconnaissance vocale ont si souvent mauvaise réputation.

**Transcription automatique de la parole (*speech-to-text, STT***)

Au cours des dernières années, et grâce à l’apprentissage profond, nous avons assisté à une évolution fulgurante des technologies de reconnaissance vocale. Cette importante percée a permis d’entraîner des engins STT qui arrivent à fournir des transcriptions vocales de haute précision pour des types presque illimités de vocabulaires. De nos jours, bon nombre de fournisseurs proposent des engins STT (par exemple, Google STT, Nuance Krypton, Amazon Transcribe, Deepgram, etc.) et il existe également des versions en code source libre (open-source).

Avec l’utilisation des engins STT, il n’est plus nécessaire de développer des grammaires, un gain de temps non négligeable lors de la création d’applications RVI conversationnelles. Est-ce que l’on vient ici de résoudre l’énigme de la reconnaissance vocale? Loin de là! Atteindre un niveau de précision acceptable reste un enjeu majeur. En fait, des grammaires adéquatement réglées apporteront un niveau de précision souvent nettement plus élevé que le meilleur des engins STT.

Actuellement, les principaux problèmes rencontrés lors de l’utilisation d’engins STT sont:

Données d’entraînement. Comme pour tout modèle basé sur l’apprentissage automatique, les performances du modèle STT seront optimales si ses données d’entraînement sont représentatives des conditions dans lesquelles il est utilisé. Ainsi, si un modèle était, par exemple, principalement entraîné sur des enregistrements obtenus à partir d’un haut-parleur intelligent, abordant typiquement des thèmes tels que la météo, le réglage des alarmes, la lecture de musique et des questions de connaissances générales, il est fort probable que ce modèle n’offrirait pas des performances optimales dans le cadre d’une application RVI de type bancaire. S’il était possible d’affiner les réglages d’un modèle STT en l’entraînant sur des données spécifiques à un domaine précis, cela pourrait faire une énorme différence en ce qui a trait à la précision. Malheureusement, la plupart des fournisseurs d’engins STT ne permettent pas cette option (exception faite de Deepgram). Notons toutefois que Nuance fournit une solution partielle en permettant d’entraîner le modèle par langue de domaine (domain language model, DLM) sur des phrases spécifiques à chaque domaine ciblé.

Contextualisation. Les engins STT peuvent conceptuellement reconnaître n’importe quel énoncé d’utilisateur, que celui-ci parle de films, de politique, de météo, de musique, peu importe. C’est une fonctionnalité très puissante, mais qui peut aussi devenir un handicap dans le contexte des applications conversationnelles, qui sont généralement spécifiques à un domaine particulier en plus d’être fortement contextualisées. Si un agent virtuel demande à un utilisateur de fournir une date de naissance, il y a fort à parier que l’utilisateur réponde en fournissant une date de naissance. La fait de savoir tirer profit de ces connaissances contextualisées peut grandement améliorer la précision de la reconnaissance vocale. Les humains font cela constamment, sans même s’en rendre compte. Certains engins STT fournissent quelques capacités de contextualisation (par exemple, la fonctionnalité d’adaptation de modèle de l’engin STT de Google), mais celles-ci restent assez limitées pour le moment.

Optimisation. Les engins de reconnaissance vocale des RVI traditionnelles offrent plusieurs moyens efficaces d’optimiser la précision. Par exemple, d’importants gains de précision peuvent être obtenus en affinant les transcriptions phonétiques, en modélisant la coarticulation à l’intérieur des mots et entre les mots, en modélisant les disfluences verbales, en ajustant les poids des différents éléments d’une grammaires ou les poids des différentes grammaires, en intervenant dans le post-traitement des meilleurs résultats (N-best results), etc. La plupart des engin STT offrent peu, sinon pas de moyens d’optimiser la précision.

Support multilingue. Nu Echo étant située à Montréal, ville bilingue, la plupart des applications conversationnelles que nous déployons doivent savoir traiter les mots anglais dans les phrases en français et vice-versa (la reconnaissance d’adresses en est un très bon exemple). Cela ne peut être fait efficacement qu’avec un engin de reconnaissance vocale capable de traiter deux langues différentes à l’intérieur d’un seul et même énoncé, une fonctionnalité disponible chez certains engins de reconnaissance vocale des RVI traditionnelles, mais dans aucun engin STT de notre connaissance.

Les technologies STT évoluent extrêmement rapidement. Nous pouvons donc nous attendre à ce que le niveau de précision de reconnaissance soit en constante amélioration, à pouvoir profiter d’outils de contextualisation et d’optimisation de plus en plus efficaces, à accéder plus facilement à des modèles pouvant être optimisés en fonction de domaines spécifiques. En attendant, la solution idéale pourrait très bien être une combinaison engins STT, engins RVI traditionnelles.

**Compréhension du langage naturel (natural language understanding, *NLU***)

Les premières applications RVI avec reconnaissance vocale reposaient exclusivement sur les grammaires SRGS pour la reconnaissance vocale; la compréhension du langage naturel (NLU) n’était donc pas un problème, le NLU étant intégré à la grammaire.

L’utilisation de modèles de langage statistiques (SLM) a fait naître le besoin d’avoir un engin NLU distinct, capable de comprendre les résultats de reconnaissance pour des énoncés spontanés. Des techniques de détection d’intentions, basées sur des techniques simples d’apprentissage automatique, ont été introduites il y a plus de 20 ans, pour des besoins reliés à l’aiguillage d’appels en langage naturel. Ces techniques font très bien fait l’affaire, mais elles nécessitent habituellement un imposant échantillon de phrases, pour chacune des intentions, afin que le modèle soit correctement entraîné, ce qui représente souvent un obstacle de taille à la mise en service d’un système.

Durant bon nombres d’années, ces techniques n’ont pas beaucoup évolué. Puis, est arrivé l’apprentissage profond, qui a totalement changé le paysage des technologies de traitement du langage naturel. Un premier grand changement a été l’introduction des représentations vectorielles continues de mots (word embeddings), qui améliorent la généralisabilité et permettent de diminuer de façon considérable la taille de l’échantillon de phrases nécessaire pour entraîner les modèles NLU. Plus récemment, des modèles de langage de grande taille (entraînés sur de gros corpus de données, par exemple BERT) et de nouvelles architectures de réseaux neuronaux apportent d’autres améliorations d’envergure.

Il est intéressant de noter que les technologies NLU utilisées pour traiter les conversations textuelles sont les même que celles utilisées pour traiter les conversations vocales, alors qu’il existe des différences importantes entre ces deux types de conversation. Par exemple, les systèmes traitant des conversations textuelles doivent être capables de gérer de manière fiable les fautes de frappe, les acronymes et les sigles (par exemple, « lol », « mdr »), les émoticônes, etc., alors que les systèmes traitant les conversations vocales doivent, de leur côté, savoir gérer les différences orthographiques entre homophones (par exemple, « cent » vs. « sans », « Desjardins » vs. « des jardins » ou « soixante-treize » (73) vs. « soixante treize » (60 13)), les normalisations de l’engin STT non-souhaitées (par exemple, « H 1 M 2 L 5 » → « H un mètre deux L cinq »), sans parler des erreurs de reconnaissance vocale.

Abordons maintenant certains problèmes reliés à l’utilisation des engins NLU :

Contextualisation. La plupart des engins NLU ne sont pas contextuels (à l’exception de Dialogflow), ce qui peut être un problème car le même énoncé peut avoir des interprétations différentes en fonction du contexte dans lequel il apparaît. Par exemple, l’interprétation de l’énoncé « Montréal » sera différente selon que la question posée était « quelle est votre destination? » ou « quelle est la ville de départ? »

Score de confiance. Un dialogue de réparation efficace doit pouvoir s’appuyer sur des scores de confiance fiables mais malheureusement, les scores de confiance des engins NLU n’ont pas tendance à être très précis. De plus, les scores des engins NLU ne prennent généralement pas en compte le score de confiance de reconnaissance vocale. Or, comment se fier à un résultat NLU s’il est lui-même basé sur un résultat de reconnaissance vocale à faible score de confiance? Pour être considérés comme sûrs, les scores de confiance des applications conversationnelles vocales doivent prendre en compte à la fois les scores STT et les scores NLU.

Meilleurs résultats (N-best results). De nombreux engins NLU ne renvoient qu’une seule intention, celle avec le score de confiance le plus élevé, même si elle apparaît auprès d’autres intentions ayant des scores presque identiques. Le fait de pouvoir avoir accès à une liste des meilleurs résultats (N-best results) permet de prendre de meilleures décisions en ce qui a trait au dialogue (par exemple, lorsque vient le temps de désambiguïser certains énoncés) ou de choisir la meilleure hypothèse en fonction d’informations contextuelles qui ne seraient pas disponibles pour l’engin NLU.

Le traitement automatique du langage naturel est actuellement l’un des domaines de recherche en intelligence artificielle des plus dynamiques et nous nous attendons à ce que les systèmes d’IA conversationnels bénéficient d’un flux continu d’avancées technologiques.

Synthèse de la parole

Les technologies de synthèse vocale (TTS) existent depuis très longtemps, mais jusqu’à tout récemment, la qualité et l’intelligibilité des résultats n’étaient pas assez bonnes pour offrir une expérience conversationnelle convenable. Les segments vocaux des meilleures applications RVI avec reconnaissance vocale étaient presque tous enregistrés en studio avec des voix professionnelles. Pour ce qui est du rendu des phrases comprenant des données dynamiques, on devait alors procéder à de la concaténation de segments, ce qui est assez complexe à faire correctement.

Mais des progrès phénoménaux ont récemment vu le jour du côté des technologies TTS. Il est ainsi désormais raisonnable, dans la plupart des cas, d’utiliser des voix de synthèse plutôt que des enregistrements studio. C’est particulièrement vrai en anglais, où la qualité de la meilleure voix de synthèse est telle qu’il est parfois difficile de la distinguer d’une voix humaine. De plus, on peut maintenant créer des voix de synthèse personnalisées qui peuvent imiter notre voix professionnelle préférée.

Le recours aux technologies TTS change vraiment la donne pour tout ce qui a trait à la création et au développement des applications RVI conversationnelles. D’une part, elles éliminent le besoin de retourner constamment en studio pour enregistrer de nouveaux segments vocaux dès que survient un changement à l’application. D’autre part, elles nous épargnent de fastidieuses manipulations de milliers de segments vocaux (dans souvent plus d’une langue), tâches trop souvent sources d’erreurs. Désormais, les applications peuvent être modifiées, testées et mises en production dans la foulée.

Bien sûr, les technologies TTS ne sont pas parfaites et nous rencontrons encore des erreurs occasionnelles, mais généralement il s’agit d’un faible prix à payer en comparaison avec l’immense valeur ajoutée qu’elles apportent. La solution idéale pourrait très bien être une combinaison d’enregistrements en studio, pour les segments audios clés où nous recherchons une intonation et une émotion précise, et de segments de synthèse personnalisés, construits à partir de la même voix professionnelle que celle utilisée dans les segments pré-enregistrés.

Intégration avec les plateformes de centre de contacts

Les applications RVI traditionnelles avec reconnaissance vocale ont longtemps adhéré à des standards éprouvés pour réaliser l’intégration de technologies conversationnelles; qu’on pense au protocole MRCP pour la reconnaissance et la synthèse vocale, au langage VoiceXML pour les dialogues, à la spécification SRGS pour les grammaires de reconnaissance vocale ou au mécanisme SISR pour l’interprétation sémantique.

Désormais, avec l’émergence d’une nouvelle génération de plateformes de centre de contacts infonuagiques et l’arrivée des plus récentes technologies basées sur l’apprentissage profond, tous ces standards deviennent obsolètes et sont remplacées par un éventail d’interfaces de programmation (API) propriétaires et de nouveaux standards émergents (par exemple, l’environnement gRPC).

L’intégration de ces nouvelles technologies conversationnelles avec les plateformes de centre de contacts demeure une tâche en cours d’évolution. Voici ce que nous constatons:

Certaines fonctionnalités de base que nous tenions habituellement pour acquises (par exemple, les interruptions (barge-in) et le repli DTMF) ne sont pas toujours disponibles
Le choix des technologies conversationnelles disponibles sur plusieurs plateformes de centres de contacts reste limité
Même lorsque des intégrations sont disponibles, il est souvent très difficile de tirer pleinement profit du potentiel des nouvelles technologies (par exemple, le fait de ne pas avoir accès aux scores de confiance ou aux listes des meilleurs résultats (N-best), le fait qu’il soit impossible de faire du post-traitement sur les résultats STT avant de les envoyer à l’engin NLU, etc.)

Certaines solutions voient tranquillement le jour afin de combler ces problèmes d’intégration. Par exemple, Audiocodes, avec son VoiceAI Connect, prétend fournir « une connectivité facile entre toute plateforme de centre de contacts et tout environnement de développement robot ou tout engin vocal » (“easy connectivity between any CC platform and any bot frameworks or speech engine”). Cela pourrait permettre d’exploiter au mieux les technologies conversationnelles en fonction des exigences propres à chaque solution à implanter.

Le meilleur des deux mondes

L’apprentissage profond a un impact fondamental sur les technologies d’IA conversationnelles et cela change considérablement la façon dont nous envisageons le développement des applications RVI. Nous sommes encore aux balbutiements de cette transformation. Ces nouvelles technologies sont encore trop immatures mais évolueront probablement très rapidement dans un avenir proche. À nous de nous adapter à leur évolution rapide et de comprendre comment les exploiter le plus efficacement possible.

Néanmoins, ces nouvelles technologies offrent déjà des avantages concrets très significatifs. Par exemple :

Il n’est désormais plus nécessaire, pour que la reconnaissance vocale fonctionne, de créer des grammaires complexes ou de recueillir des milliers d’énoncés d’entrainement pour des modèles SLM. La précision de reconnaissance vocale des meilleurs engins STT est, d’emblée, suffisamment acceptable, de sorte qu’il est désormais possible de mettre rapidement en production un système opérationnel.
Les plus récents engins NLU peuvent être entraînés avec énormément moins de phrases que les anciennes technologies de classification NLU, ce qui permet, ici encore, de pouvoir mettre très rapidement en production la première version d’un système.
Les plus récentes technologies de synthèse vocale sont devenues si performantes qu’il n’est maintenant presque plus nécessaire de recourir à des segments audios pré-enregistrés (en particulier pour l’anglais). Cela réduit considérablement le délai nécessaire pour concevoir et mettre en production des nouvelles versions d’une application, facilitant et accélérant ainsi grandement leur déploiement.

La possibilité de mettre rapidement en service une première version d’une application est cruciale car elle permet de commencer rapidement à recueillir de vraies données conversationnelles et des énoncés d’usagers réels, matière première avec laquelle le système peut être amélioré et optimisé de façon continue.

Alors que certaines des limites des technologies STT commencent à être prises en considération (par exemple, en termes de contextualisation, d’optimisation, de traitement multilingue, etc.), les développeurs d’applications RVI conversationnelles devraient envisager de combiner les technologies STT avec les technologies de reconnaissance vocale des RVI traditionnelles afin d’obtenir le meilleur des deux mondes et d’offrir aux utilisateurs de RVI conversationnelles des expériences remarquables (certaines plateformes RVI, par exemple la plateforme vocale de Genesys, permettent cette combinaison d’approches).

The post La nouvelle RVI arrive en ville. Qu’est-ce que ça signifie? first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post La nouvelle RVI arrive en ville. Qu’est-ce que ça signifie? appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Retour sur l’entrevue live sur Google CCAI de VUX World avec Antony Passemard

Yves Normandin — Fri, 26 Mar 2021 18:05:19 +0000

Dialogflow CX vs. ES

L’entrevue débutait par une comparaison entre Dialogflow CX et ES. CX n’est pas qu’une simple évolution par rapport à ES; c’est en fait un redesign complet, avec un modèle de dialogue plus intuitif et beaucoup plus puissant. CX propose entre autres une nette séparation entre les intentions et les dialogues, ce qui augmente considérablement la réutilisabilité des intentions et la gestion des dialogues. CX vient avec une interface visuelle qui peut facilement être utilisée par les “Conversational Architects” afin de créer des dialogues complexes tout en utilisant moins de lignes de code.

Selon Passemard, c’était une demande de longue date de la part de beaucoup de clients. Bien que Google continuera à supporter et faire évoluer Dialogflow ES, qui reste approprié pour des dialogues simples, Dialogflow CX devrait devenir la plateforme toute désignée pour gérer les dialogues longs et complexes. De plus, Dialogflow CX offre plusieurs avantages par rapport à ES :

Une tarification plus prévisible (mais pas nécessairement moins coûteuse)
Plusieurs fonctionnalités spécifiques pour la RVI (incluant le support du ‘’barge-in’’, du DTMF, des “timeouts” et des “retries”)
La possibilité de supporter jusqu’à 40 000 intentions (comparativement à 2 000 avec ES)
Davantage de fonctionnalités de collaboration qui permettent aux équipes de développement de travailler plus efficacement sur de grands projets
Un meilleur support pour l’analytique, l’expérimentation et les boucles de rétroaction.
Un engin NLU plus performant, basé sur le dernier modèle BERT.

Tout le monde peut utiliser Dialogflow aujourd’hui. Cependant, pour la RVI conversationnelle, l’intégration de Dialogflow à une plateforme de centre de contact reste généralement un défi. Par exemple, la plateforme RVI doit pouvoir fournir à Dialogflow certains événements ou paramètres, que ce soit pour utiliser le DTMF pour des choix de menus, ou encore pour la gestion incrémentielle d’événements “no-input”.

Passemard a mentionné que certaines solutions, telles que Audiocodes, peuvent faciliter cette intégration. Fait intéressant, il a également mentionné qu’il est préférable de transmettre le flux audio directement à Dialogflow plutôt que d’utiliser Google STT pour transcrire l’audio et envoyer par la suite la transcription vers Dialogflow. La raison étant que Dialogflow dispose d’une fonction d’adaptation vocale automatique qui optimise la précision de la transcription en fonction des phrases d’apprentissage de l’agent.

Cela dit, notre propre expérience montre que nous pouvons souvent obtenir des résultats aussi bons sinon meilleurs en transmettant l’audio directement à Google STT, en utilisant l’adaptation vocale. De plus, il est souvent nécessaire de post-traiter les résultats de transcription afin de les rendre compatibles avec le NLU de Dialogflow, ce qui n’est pas possible lorsqu’on transmet le flux audio directement à Dialogflow.

L’Assistant à l’Agent pour le canal voix

Le sujet suivant abordé dans l’entrevue était l’Assistant à l’Agent (“Agent Assist”). C’est un sujet d’importance pour au moins deux raisons. D’abord, parce qu’il existe des cas d’utilisation très prometteurs pour l’Assistant à l’Agent et ensuite parce que nous avons beaucoup entendu parler de l’Assistant à l’Agent CCAI au cours des deux dernières années, mais qu’il a été difficile de comprendre exactement comment accéder à cette capacité.

Sur ce dernier point, Passemard a confirmé ce que nous soupçonnions, c’est-à-dire qu’il n’y a pas d’API publique pour l’Assistant à l’Agent vocal, Google ayant décidé de le rendre disponible uniquement via les partenaires de téléphonie CCAI. Comme le mentionne Simms, c’est probablement une bonne stratégie commerciale pour Google. En travaillant de manière agressive avec les partenaires de téléphonie pour intégrer l’Assistant à l’Agent à leurs plates-formes et en vendant cette fonctionnalité exclusivement à travers ces partenaires, Google pourrait faire en sorte de devenir le choix de facto pour l’Assistant à l’Agent.

L’inconvénient, par contre, est que les entreprises dépendent entièrement de la motivation et de la capacité des fournisseurs des plateformes de centres de contact à rendre CCAI disponible à leur clientèle. Il pourrait donc s’écouler beaucoup de temps avant que nombre d’entreprises ne puissent tirer parti de CCAI et, lorsque ça sera possible, ça pourrait demander des mises à niveau très coûteuses de leur infrastructure de centre de contact. Pour cette raison, les clients pourraient finir par se rabattre sur les solutions alternatives qui deviendront inévitablement disponibles.

Cela m’amène aux cas d’utilisation d’Agent Assist. Passemard a mentionné que la fonctionnalité de proposer des documents pertinents aux agents en fonction de la conversation n’a pas été jugée très utile par les clients. Les agents ne veulent pas lire des documents complets pour trouver la réponse aux questions des clients. Ils veulent une recherche extractive, qui peut extraire automatiquement la partie pertinente du document. Et, d’après ce qu’on a compris, cette fonctionnalité arrivera bientôt. Mais ce qui décolle vraiment en ce moment selon Passemard, c’est la capacité de remplir automatiquement des formulaires en temps réel avec les informations mentionnées par l’appelant. Et, bien sûr, un autre avantage de l’Assistant à l’Agent est d’obtenir une transcription de chaque appel.

L’Assistant à l’Agent pour le chat

Passemard a indiqué que l’Assistant à l’Agent pour le chat améliore considérablement la productivité et la satisfaction des agents et les scores CSAT. En particulier, les fonctionnalités Smart Reply et Smart Compose sont fournies à l’aide de modèles prédictifs entraînés sur les données du client, ce qui les rend beaucoup plus précises. L’Assistant à l’Agent pour le chat n’est actuellement disponible qu’à travers les fournisseurs de chat, mais une API publique sera bientôt disponible.

Insights

La dernière fonctionnalité CCAI mentionnée est Insights, qui est le nom que Google donne pour l’analytique des conversations (“speech analytics”). Insights est toujours en préversion, mais la bonne nouvelle est que ça sera bientôt accessible à tous avec une API publique. Insights consiste à comprendre les conversations qui se déroulent dans le centre de contact. En utilisant Insights, les entreprises pourront examiner les conversations, les indexer, les rechercher, faire de la modélisation de sujets et de l’analyse des sentiments, naviguer dans une conversation et effectuer des recherches en langue naturelle (par exemple, « donnez-moi toutes les conversations avec un message d’accueil » ). Google offrira une intégration SIPREC.

Notes finales

Passemard a mentionné que l’IA conversationnelle est probablement la première application de l’IA à avoir un impact profond sur les consommateurs. C’est une affirmation intrigante; il serait intéressant de voir quelques données qui soutiennent ses propos. Il conclut également en insistant sur l’importance d’un bon architecte conversationnel. Nous ne pourrions être plus en accord avec lui. Ce n’est certainement pas une expertise qui peut s’apprendre en deux semaines. Les très bons ont des années d’expérience et sont essentiels à la réussite de tout projet conversationnel.

The post Retour sur l’entrevue live sur Google CCAI de VUX World avec Antony Passemard first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Retour sur l’entrevue live sur Google CCAI de VUX World avec Antony Passemard appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Retour sur le Nuance Partner eXperience Summit: Une transformation accélérée dans un marché fluide (Article en Anglais)

Yves Normandin — Tue, 03 Mar 2020 15:18:04 +0000

Since Mark Benjamin joined Nuance as its new CEO almost two years ago, the company has been going through a breathtaking transformation. After selling its imaging division to Kofax and spinning off its automotive division, the company now focuses primarily on its core business of providing conversational AI products and solutions.

Even in its core conversational AI business, Nuance is fast transforming itself from a software company with a very large focus on professional services to a product and platform company. This was evident at the 2019 Partner eXperience Summit, but it was even more so at this year’s event.

This change was of course necessary and, I might add, a bit overdue. Long the dominant vendor of enterprise speech technology solutions, Nuance is now being challenged by companies – Google, Amazon, Microsoft, and IBM among others – that offer easy to use conversational AI platforms with state-of-the-art technologies. With these platforms, the claim is that anybody can now develop sophisticated conversational AI solutions; that speech recognition (ASR) and natural language understanding (NLU) work “out-of-the-box” without any need for speech scientists; and that, in fact, you don’t even need developers to build solutions. This is the “do-it-yourself” (DIY) message and it is a compelling one.

Of course, that message is highly misleading. Yes, to some extent, the technology now works “out-of-the-box” in the sense that it is possible to get a simple conversational demo bot up-and-running quickly. With speech-to-text (STT) engines, there is no need to write speech recognition grammars and NLU engines can be trained with a few training phrases per intent. But that’s only good for a demo. Building an effective, enterprise-grade conversational AI system is hard work, no matter what the platform is (more on that in a future blog post).

What is true, though, is that enterprises really are looking for DIY tools. And they are increasingly demanding cloud-native solutions. And, above all, they want flexibility. And Nuance has heard that message loud and clear. They now understand that it’s no longer sufficient to have best-in-class technology and a good professional services organization. Customers want to have flexible development and deployment models.

The most recent big steps that Nuance has taken in that direction are:

Conversational AI APIs (launched November 2019);
The Nuance GateKeeper cloud based security and biometrics suite (launched October 2019);
Nuance Mix: DIY Tooling for partners and end users (general availability planned for end of March)

The introduction of Nuance Mix, in particular, is a big change for a company that is used to directly delivering most of its conversational AI solutions through its professional services organization, using closely guarded development tools. But what we’ve seen so far of Mix is promising, with a slick, contemporary user interface. From a company that has years of experience building and deploying compelling conversational AI solutions, this is quite encouraging.

Nuance is facing powerful new competitors, but it has many advantages. Its technology is top-notch, it has a very large installed base, it offers the most flexible deployment models (premise or cloud), its technology is integrated with most contact center platforms, and it understands better than anybody what it takes to deliver conversational experiences that work not just in demos, but in the real world. Nuance also offers the most extensive capabilities to adapt and optimize the technology for a specific domain and a specific dialog state, which is often what makes the difference between a good demo and an enterprise-grade solution.

Another Nuance differentiator – which they position as a key element of their value proposition – is its strong professional services organization. But that could also turn out to be its Achilles’ heel, because customers no longer want to be dependent on the vendor’s PS; they want to know that there is a large pool of people that are skilled on the technology and have all the tools necessary. It will be a challenge to change a company that is culturally used to delivering all the big projects into one that enables its partners and customers to do it themselves.

In conclusion, Nuance is clearly going in the right direction and making all the right moves, but its plan is ambitious, so execution will be key. Perhaps the biggest challenge will be to implement the culture changes that are required in order to successfully implement this transformation.

We’ve been in this market for close to 20 years and these are by far the most interesting times we’ve seen. We’re expecting quite a ride in the next few years.

The post Retour sur le Nuance Partner eXperience Summit: Une transformation accélérée dans un marché fluide (Article en Anglais) first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Retour sur le Nuance Partner eXperience Summit: Une transformation accélérée dans un marché fluide (Article en Anglais) appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

FAQ : expérimentations avec les connecteurs de connaissance de Dialogflow (Article en anglais)

Yves Normandin — Wed, 15 Jan 2020 18:00:54 +0000

Chatbots come in multiple forms and can serve many different purposes. Without pretending to exhaustivity, we can mention

the task-oriented bots, that aim to assist a user in a given set of transactional tasks, like, for example, banking operations
the chit-chat bots, whose primary objective is to mimic casual conversation
and the question answering bots, whose purpose is to, you guessed it, answer user’s questions.

These categories are not mutually exclusive: A task-oriented bot can support some level of small talk, and question answering bots can assist the user in some tasks. These are to be perceived as paradigms more than strict definitions.

In this article, we will focus on the concept of the question answering chatbot, and more specifically on the implementation of this concept in Dialogflow, using Knowledge connectors (still a beta feature at the moment of writing).

About Dialogflow FAQ knowledge connectors

Knowledge connectors are meant to complement the intents of an agent and offer a quick and easy way to integrate existing knowledge bases to a chatbot. Dialogflow offers two types of knowledge connectors: FAQ and Knowledge Base Articles. Here we will mostly focus on the FAQ knowledge connector, which models the knowledge bases as a list of question-answer pairs (QA pairs).

Each QA pair in a FAQ knowledge connector can be seen as a special kind of intent that has a single training phrase and a single text response. At first sight, the main advantages of a FAQ knowledge connector over defined intents seem to be the ease of integrating external knowledge bases and the fact that, contrary to defined intents, more than a single response can be returned (which can be convenient for a search mode).

Are there any other advantages? One of our hypotheses when we started this work was that knowledge connectors would be able to leverage the answer in the QA pair when matching the query, not just the question. This is not explicitly mentioned in the documentation, but it would make sense for two reasons. First, it’s hard to believe that any NLU engine can effectively learn from a single training phrase. There are always many ways to ask a question that don’t look at all like the training phrase. Second, FAQ data sources often have long answers that could conceivably be correct answers to a wide range of questions other than the one provided. When trying to find the correct answer to a user query, it would therefore make sense for the engine to focus as much on finding the answer that best answers the query as on finding the question that best matches the query.

Anatomy of the Knowledge Base

The knowledge base we used was taken from the Frequently Asked Questions (FAQ) section of the website of a North American airport. It contains more than a hundred QA pairs, separated in a dozen categories. Each category contains a number of subcategories ranging from only one to about ten.

While some questions have straightforward answers, others have complex, multi-paragraphs ones. All the answers are primarily composed of text, but many also contain tables, images, and some even contain videos. Many answers also have hyperlinks leading to other parts of the FAQ or external pages.

Minor surgery on the Knowledge Base

While analyzing the knowledge base, we found that several questions only made sense within the context of the category and sub-category in which they appear. For instance, in the Parking section, we have the question “How do I reserve online?”. The FAQ context makes it clear that this is a question about parking reservation, but this information is lost when modeling the knowledge base as a CSV-formatted list of question-answer pairs (QA pairs). We therefore had to modify several of the original questions so that they could be understood without the help of any context. So, in the example above, the question was changed to: “How do I reserve a parking space online?”.

What questions users ask

The airport website offers users two distinct ways to type queries to get answers: one that clearly looks like a search bar and another one that looks like a chat widget that pops when clicking a “Support” button on the bottom right of the web page. Both of them do the exact same thing: They perform a search in the knowledge base and return links to the most relevant articles. However, we believe that the chat-like interface entices more complex, natural queries since the users may believe they are entering a chat conversation.

The airport provided us with a set of real user queries collected from the two query interfaces. This is very important because this tells us what questions users are really asking and it provided us with real user data for our experiments.

Of course, we had to do some cleaning on that data set, as a good number of queries were not relevant for our purpose. Things like digit strings (most likely phone numbers and extensions), flight numbers with no other indications, or purely phatic sentences (for example, “how are you?”). We also observed that the queries could be separated into two groups: either they were really short and to the point, with one or two words at most, or they were long and complex, with lots of information, details, and usually formulated as a question.

Augmenting the corpus

Once the data set was cleaned, we ended up with about 300 queries (down from a little more than 1500!). Clearly, this would not be sufficient for our experiments, so we decided to collect additional data that, we hoped, would still be representative of real user queries.

We considered using crowdsourcing solutions (like Amazon Mechanical Turk) but ultimately decided to try other options. Instead, we used the People also ask and Related searches functionalities of Google Search to glean additional user data. We would start with a user query (real or fabricated) and collect the related questions proposed by Google. One interesting feature of the People also ask functionality is that every time we expand one of the choices, it proposes several additional related questions. This way, we ended up collecting about 300 additional queries with little to no effort, effectively doubling the number of queries we had.

At the same time, we also organized an internal data collection at Nu Echo, where our colleagues would have to write plausible user queries based on general categories that we assigned to them. This gave us over 400 hundred additional queries, bringing our total to about a thousand.

Annotating the corpus

Annotating the corpus consists in manually determining which QA pair in the knowledge base, if any, correctly answers each of the queries in the corpus. While this sounds simple, it proved to be a surprisingly difficult task. Indeed, the human annotator has to carefully analyze each potential answer before deciding whether or not it’s a correct response to the query. For some queries, there was no correct answer, but there were one or several QA pairs that provided relevant answers.

What we ended up doing was separate the corpus in 3 categories:

Queries with a correct answer (an exact match);
Queries without an exact match but with one or several relevant answers (relevant matches);
Queries without any match at all.

Queries in the second category would be labeled with all relevant QA pairs. When we finished annotating, only 33% of the queries had an exact match, even if 91% of the corpus can be considered “in-domain”. An interesting observation is that the FAQ coverage varied significantly based on the source of the queries, as shown in the table below.

Source	Count	Exact match	Coverage
Google	275	133	48.36%
Website queries	303	63	20.79%
Nu Echo	440	150	34.09%
Total	1018	346	33.99%

Our explanation is that the Google queries tended to be simpler and more representative of real user queries, the website queries were often out-of-domain, incomplete or ambiguous. The Nu Echo queries tended to be overly “creative” and generally less realistic.

Train and test set

We split our corpus into a train set and a test set. The queries in the train set are used to improve accuracy while the test set is used to measure accuracy. Note that this is a very small test set. It contains 407 queries, of which only 151 have an exact match (37%). It is also very skewed: The top 10% most frequent FAQ pairs account for 61% of those 151 queries.

Performance metrics

To measure performance, we need to decide which performance metrics to use. We opted for precision and recall as our main metrics. They are defined as follows:

Precision: of all the predictions returned by Dialogflow, how many of them are actually correct?
Recall: of all the actual responses we’d like to get, how many of them were actually predicted by Dialogflow?

In our case, we considered only exact matches and the top prediction returned by Dialogflow. One reason for this is that relevant matches are fairly subjective and we have found that the agreement between annotators tends to be low. Another reason is that this makes comparison with other techniques (e.g., using defined intents) easier since these techniques may only return one prediction.

Since Dialogflow returns a confidence score that ranges from 0 to 1 for each prediction it makes, we can control the precision-recall tradeoff by changing the confidence threshold. For example:

when the threshold is at 0, we accept all predictions, and the recall is at its highest, while the precision is usually at its lowest;
when the threshold is at 1, we exclude almost all predictions, so the recall will be at its lowest, but the precision usually is the highest.

When shown graphically, this provides a very useful visualization that makes it easy to quickly evaluate the performance of an agent against a given set of queries, or to compare agents (see results below).

We’re now ready to delve into some of the experiments we performed. Note that the data that has been used to perform these experiments are publicly available in a Nu Echo GitHub repository.

Experiments with the FAQ Knowledge connector

We took all of the QA pairs we extracted from the airport knowledge base and pushed those to a Dialogflow Knowledge Base FAQ connector. Then we trained an agent and tested this agent with the queries in the test set. Here’s the result.

Ouch! This curve shows, at best, a recall of barely 40%. And that’s with less than 30% precision. Something is definitely wrong here. A first analysis of the results reveals something very interesting: The question in the QA pair that correctly answers the user query is often very different from the query. For instance, the correct answer to the query “Can I bring milk with me on the plane for the baby?” is actually found in the QA pair with the following question: “What are the procedures at the security checkpoint when traveling with children?”. In other words, those two formulations are too far apart for any NLU engine to make the connection. In order to identify the correct QA pair, one really has to analyze the answer in order to determine whether it answers the query.

Unfortunately, Dialogflow seems to mostly rely on the question in the QA pair when predicting the best QA pairs and that creates an issue: The more information there is in a FAQ answer, the more difficult it is to reduce it to a single question.

What if QA pairs could have multiple questions?

Contrary to defined intents, Dialogflow FAQ knowledge connectors are limited to a single question per QA pair. While this makes sense if the goal is to use existing FAQ knowledge bases “as is”, it may limit the achievable question answering performance. But what if we work around that restriction by including multiple copies of the same QA pair, but using different question formulations (different questions, same answer)? This could allow us to capture different formulations of the same question, as well as entirely different questions for which the answer is correct.

Here is how we did it:

We selected the top 10 most frequent QA pairs in the corpus. For each of them, we created several new QA pairs containing the same answer, but a different question (using questions from the train set). We called this the expanded FAQ set.
We created a new agent trained with this expanded set of QA pairs.
We tested this new agent on the test set.

The graph below compares the performance of this new agent with the original one. There is a definite improvement in recall, but precision still remains very low.

FAQ vs Intents

How do defined intents compare with Knowledge Base FAQ? To find out, we created an agent with one intent per FAQ pair. For each intent, the set of training phrases included the original question in the QA pair, plus all the queries in the train set labelled with that QA pair as an exact match. Then we tested this new agent on the test set. The graph below compares this new result with the previous two results.

That is an amazing jump in performance. Granted, these are not great results, but at least we know we are heading in the right direction and that performance could still be improved a lot.

A quick look at Knowledge Base Articles

As mentioned before, Dialogflow offers two types of knowledge connectors: FAQ and Knowledge Base Articles. Knowledge Base Articles are based on the technologies used by Google Search, which look for answers to questions by reading and understanding entire documents and extracting a portion of a document that contains the answer to the question. This is often referred to as open-domain question answering.

We wanted to see how this would perform on our FAQ knowledge base. To get the best possible results, we reviewed and edited the FAQ answers to make sure we followed the best practices recommended by Google. This includes avoiding single-sentence paragraphs, converting tables and lists into well-formed sentences, and removing extraneous content. We also made sure that each answer was completely self-contained and could be understood without knowing its FAQ category and sub-category. Finally, whenever necessary, we added text to make it clear what question was being answered. The edited FAQ answers are provided in the Nu Echo GitHub repository.

The result is shown below (green curve, bottom left). What this shows is that Knowledge Base Articles just doesn’t work for that particular knowledge base. The question is: why?

Although further investigation is required, a quick analysis immediately revealed one issue: Some frequent QA pairs don’t actually contain the answer to the user query, but instead provide a link to a document containing the desired information. This may explain why, in those cases, the Article Knowledge Connector couldn’t match the answer to the query.

Conclusion

We wanted to see whether it was possible to achieve good question answering performance by relying solely on Dialogflow Knowledge Connector with existing FAQ knowledge bases. The answer is most likely “no”. Why? There are a number of reasons:

While defined intents can have as many training phrases as we want, FAQ knowledge bases are limited to a single question per QA pair. This turns out to be a significant problem since it is difficult to effectively generalize from a single example. That’s especially true for QA pairs with long answers, which can correctly answer a wide range of very different questions.
FAQ knowledge bases are often not representative of real user queries and, therefore, their coverage tends to be low. Moreover, they often need a lot of manual cleanup, which means that we cannot assume that the system will be able to automatically take advantage of an updated FAQ knowledge base.
Many user queries require a structured representation of the query (i.e., with both intents and entities) and a structured knowledge base to be able to produce the required answer. For instance, to answer the question “Are there any restaurants serving vegan meals near gate 79?”, we need a knowledge base containing all restaurants, their location, and the foods they serve, as well as an algorithm capable of calculating a distance between two locations.
Many real frequent user queries require access to back-end transactional systems (e.g., “What is the arrival time of flight UA789?”). Again, this cannot be implemented with a static FAQ knowledge base.

The approach we recommend for building a question answering system with Dialogflow is consistent with what Google actually recommends, that is, use Knowledge Connectors to complement defined intents. More specifically, use the power of defined intents, leveraging entities and lots of training phrases, to achieve a high success level on the really frequent questions (the short tail).

Then, for those long tail questions that cannot be answered this way, use knowledge connectors with whatever knowledge bases are available to propose possible answers that the user will hopefully find relevant.

Thanks to Guillaume Voisine and Mathieu Bergeron for doing much of the experimental work and for their invaluable help writing this blog.

Conversational automation initiatives

Take our survey on innovative technologies on the customer experience, and find out the results from the dataset. (All data collected will remain anonymous.)

The post FAQ : expérimentations avec les connecteurs de connaissance de Dialogflow (Article en anglais) first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post FAQ : expérimentations avec les connecteurs de connaissance de Dialogflow (Article en anglais) appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Assistants virtuels, agents conversationnels, agents virtuels, système de réponse vocale interactive : comment y voir plus clair ? (Article en anglais)

Yves Normandin — Wed, 11 Dec 2019 18:00:09 +0000

In the past few years, we have witnessed the introduction of a bunch of new terms and expressions related to conversational systems and interfaces: chatbots, voicebots, intelligent virtual agents (IVAs), intelligent virtual assistants (IVAs), etc. Unfortunately, all of these tend to mean different things to different people, which ends up generating a lot of confusion in the industry.

In an attempt to, if not eliminate, at least reduce some of that confusion, I’ll propose some broad definitions for these terms.

A chatbot is an automated system with which users interact through a “chat-like” interface. This includes messaging channels such as Messenger, WhatsApp, Slack… but it also includes SMS, iMessage, as well as other chat-like interfaces such as web chats, chat widgets in mobile applications, etc. Although chatbot interactions should primarily be done through text input and output, they in practice increasingly incorporate rich media (depending on what the channel supports) such as buttons, images, carousels, webviews, etc. In reality, many chatbots have little or no support for text input, relying primarily on buttons for user input. A chatbot is not necessarily conversational (see here for an explanation of what we mean by conversational) and in fact most chatbots are highly directed, menu driven “dialogs”.

A voicebot is a chatbot with which users can interact vocally. This assumes that the chatbot behind the voicebot can handle natural language input and it requires a capability to convert voice input into text (or directly into intents), as well as text output into voice. Example voicebots include any bots accessible through a voice channel, which include the now ubiquitous smart home speakers, but also the plain old telephone channel as well as any VoIP channel, for instance the call channels of Skype, Messenger, WhatsApp, Slack, etc. In that sense, a conversational IVR could be seen as a voicebot. Another example would be a Dialogflow voicebot, accessible through any voice channel, that takes advantage of Dialogflow’s ability to detect intent from audio.

An Intelligent Virtual Agent (IVA) is a robot that simulates an agent (which, in this context, really means a contact center agent). It provides some of the services normally provided by a contact center agent through a communication with users – via voice or text channels – that resembles human-to-human communication. For reference, DMG defines an IVA as “A system that utilizes artificial intelligence, machine learning, advanced speech technologies (including NLU/NLP/NLG) to simulate live and unstructured cognitive conversations for voice, text, or digital interactions via a digital persona.” A virtual agent can hence be a chatbot, a voicebot, or both.

An Intelligent Virtual Assistant (also IVA, unfortunately) is a system that is dedicated to helping its user, either by providing useful information or advice (weather or traffic information, financial advice, etc.), by answering questions, or by accomplishing tasks on his/her behalf (e.g., planning meetings, booking hotels, paying bills, whatever). Interaction with an intelligent virtual assistant is often done through text or voice conversational channels, which effectively makes it a chatbot or a voicebot, but it can also be done through mobile or web applications.

An IVR (Interactive Voice Response) is an interactive telephone system that is primarily used in a call center to steer calls to the appropriate agent, and possibly to enable callers to perform some self-service transactions. Most IVR systems today are anything but conversational, relying instead primarily on menu navigation through DTMF (touch-tone) user inputs. Several IVR systems also enable speech input, but most of these only support voice menus and directed dialogs. More recently, natural language call steering applications, which enable callers to state the purpose of their call in their own words, have gained in popularity, but that remains a very small minority of IVR systems out there. The surge in popularity of conversational systems, however, is inevitably now impacting IVR, so expect to see a rapidly increasing number of IVR voicebots being deployed in the near future.

The post Assistants virtuels, agents conversationnels, agents virtuels, système de réponse vocale interactive : comment y voir plus clair ? (Article en anglais) first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Assistants virtuels, agents conversationnels, agents virtuels, système de réponse vocale interactive : comment y voir plus clair ? (Article en anglais) appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Est-ce que votre FAQ répond vraiment aux questions ? (Article en anglais)

Yves Normandin — Mon, 25 Nov 2019 18:00:43 +0000

From FAQs to chatbots: Improve customer experience with conversational question answering.

A significant portion of customer service inquiries is about users wanting an answer to a question. Organizations are rightfully motivated to provide efficient means for users to find answers to their questions autonomously (i.e, without interacting with a human agent) since it can improve user experience while greatly reducing costs by freeing up valuable time for their customer service agents.

To achieve this, organizations traditionally propose a Frequently Asked Questions (FAQ) section on their website and they also often provide a search capability that can return relevant articles from a knowledge base, the website, or both. In many cases, these can provide fairly effective means for users to get the information they’re looking for, therefore reducing pressure on the contact center.

In that context, how can a question-answering customer service chatbot add value? Certainly, that cannot be by providing a chat-like interface to a static FAQ or to an existing website search capability. That just wouldn’t be very compelling (for a discussion on this topic, see Tobias Goebel’s great blog post explaining why we can’t just convert FAQs into a chatbot 1:1).

Chatbot question answering: beyond static FAQs and search

In order to really provide question answering value, a customer service chatbot has to go beyond the FAQ capabilities already provided on the website. This can be achieved in a number of ways, including by:

Directly answering user’s questions rather than providing links to relevant documents. If I ask “Are strollers allowed on airplanes?” I’d like to have a clear response (“Yes, strollers are allowed.”) rather than list of articles that may or may not answer my question.
Truly leveraging a conversational interface, for instance by enabling the chatbot to clarify vague questions:

User: I’m looking for a telephone number
Chatbot: Who would you like to call?
User: Lost items
Chatbot: The lost-and-found telephone number is 123-456-7890

Or by enabling users to ask follow-on questions:

User: Can I bring breast milk on a plane?
Chatbot: Yes, breast milk is allowed on airplanes.
User: What about strollers?
Chatbot: Strollers are also allowed.

Providing dynamic and/or personalized answers, which require access to back-end systems. For instance:

What is the arrival time for flight United 285?
When should I expect to receive my luggage?

Enabling question answering at any time during the course of a chatbot conversation.
Giving users the ability to continue the conversation with a human agent, if the chatbot isn’t able to solve the user’s issue.

In a chatbot, the very frequent queries (the short tail) can – and should – be handled using standard approaches (e.g., with intents and entities). While that requires work to maintain the chatbot to handle those new frequent queries that will inevitably occur, it’s the approach that will provide the best results.

Meanwhile, however, there will always be all those long tail queries that would just require too much effort to try to support that way. So when the chatbot doesn’t have the answer to a question, it is best to fall back to a search-like mode that can automatically leverage all those documents and knowledge bases that you already have. They most likely contain answers to many of these questions. This not only reduces development effort, but it makes it much easier to keep the system up to date with the latest answers.

Search-like capabilities in conversational platforms

Some conversational platforms provide search-like capabilities that make it possible to automatically leverage existing knowledge bases or documents to search for answers to those user queries that the chatbot cannot answer. For instance:

Chatbots developed with Watson Assistant can leverage Watson Discovery for that purpose. Performance can be improved by using Watson Knowledge Studio to teach Watson about the language and relationships that are useful in order to understand your specific domain or industry.
Chatbots developed with Google Dialogflow can leverage Dialogflow’s Knowledge Connectors to search knowledge bases for a response to a user query. Knowledge connectors are offered in two varieties: FAQs and knowledge base articles. FAQs are used to integrate existing Frequently Asked Questions (e.g., from a website). In that case, finding a response means finding the FAQ question-answer pairs (QA pairs) that best match the user query. With knowledge based articles, Dialogflow actually looks for the answer to user queries within the articles and returns the most relevant portion of the article as answer.

In future blog posts, we will report on experiments with some of these platforms. Stay tuned.

The post Est-ce que votre FAQ répond vraiment aux questions ? (Article en anglais) first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Est-ce que votre FAQ répond vraiment aux questions ? (Article en anglais) appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Speech TEK 2019 : comment l’IA conversationnelle s’invite sur le canal téléphonique (Article en anglais)

Yves Normandin — Fri, 03 May 2019 18:00:29 +0000

Conversational AI was clearly one of the biggest themes this year at SpeechTEK (Apr 29 – May 1, 2019, Washington DC). And SpeechTEK being a speech technology conference, the emphasis was naturally on voice, rather than text conversations. Conversational AI is first and foremost about Intelligent Virtual Agents/Assistants (IVAs), which are robots that provide services through a user interface that simulates human-to-human communication.

Of course, there is nothing really new about all this. Chatbots have been all the rage for over three years. What was really new this year at SpeechTEK was this unmistakable feeling that conversational AI over the phone had suddenly become mainstream. Clues were everywhere, starting with the strong participation of Google Cloud, Twilio, and Gridspace, all Diamond Sponsors at the conference. Both Twilio and Google Cloud had keynote sessions, with conversational AI as the main topic. But the strongest hints of all came from casual conversations with attendees, which were by and large seeing telephone IVAs as inevitable and were looking for the best solutions to turn this into a reality.

Telephone calls are not going away

Maybe this has to do with the fact that call volumes into contact centers aren’t going down but customer expectations are going up as a result of their experience with personal assistants like Siri and Google Assistant, as well as smart speakers like Amazon Echo and Google Home. In that context, companies cannot continue to ignore their old IVR system that is increasingly becoming the worst portion of their customer’s journey with them.

The challenge is how to provide that great conversational experience over the telephone. In the chatbot world, the great majority of chatbot developers have long ago realized that it’s hard to make sure that natural language understanding (NLU) technology works well enough to provide a great user experience and have mostly resorted to adopting very directed menu-based interactions, limiting the use of NLU to where it is absolutely necessary. And, by and large, that works quite well for many use cases. But in the IVR world, that’s not an option because directed, menu-based interactions are what most companies offer today and users simply hate it.

Accuracy is key…

In their talks, Google Cloud rightly insisted on the importance of speech-to-text (STT) and NLU accuracy. Last year, Google Cloud introduced enhanced acoustic models for the telephone, which cut error rates in half and I’m sure they will continue to improve accuracy. I also expect STT vendors to eventually introduce features that will enable developers to further improve accuracy by being able to tell the engine what types of responses are most likely. Even if, in principle, users can say anything, the odds are high that if the bot is asking a question, the user will respond to that question rather than say something totally unrelated. Being able to give indications to the STT engine about what users are likely to say can make a huge difference in accuracy. Speech technology vendors like Nuance have known that forever and they make sure that developers have this kind of control.

…but conversational user experience design is critical

Another, but related topic that was discussed at length at SpeechTEK is the importance of conversational user experience (CUX) design expertise and the lack of such expertise in the market. This has been an issue for chatbots, but to a certain extent companies have been able to work around it by leveraging rich media features available on messaging channels. On the telephone, where the interaction is primarily through a voice conversation, CUX expertise is critical. Simply managing a conversation that is natural and productive to the user requires strong CUX skills. But the challenge is even greater in the context where the bot always has to deal with uncertain STT and NLU inputs and therefore has to use efficient repair dialogue to deal with this uncertainty. There was, by the way, a very interesting presentation on this topic at the conference by Bruce Balentine.

Beyond IVAs

Finally, beyond IVAs, there was also a lot of discussion at SpeechTEK on other dimensions of conversational AI, namely, agent assistants and speech analytics. Agent assistants provide real-time guidance to agents based on the analysis of the ongoing conversation between the agent and the customer. This was one of the topics discussed by Google Cloud during their keynote (part of their Contact Center AI offering), but other vendors also presented solutions in that space, namely ttec Associate Assist and Gridspace Relay.

So, all in all, a very interesting conference, from my perspective. Let me know if you detected other important trends at the conference.

The post Speech TEK 2019 : comment l’IA conversationnelle s’invite sur le canal téléphonique (Article en anglais) first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Speech TEK 2019 : comment l’IA conversationnelle s’invite sur le canal téléphonique (Article en anglais) appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.