Yves Normandin - AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo

There is a new IVR in town. Here’s what it means

Yves Normandin — Wed, 05 May 2021 14:07:25 +0000

And that’s nothing new, really. We used to call that “speech recognition IVR” and we’ve been delivering these conversational experiences for 20 years.

What is new is that there are now novel technologies and platforms that promise to make it much faster and easier to create these conversational experiences while greatly expanding the range of tasks that virtual voice agents (that’s what we call them) can handle.

These novel technologies initially emerged in the context of voice assistants (Siri, Amazon Echo, Google Home) and are in the process of fundamentally changing the way IVR solutions are developed.

To understand how, let’s compare the “Traditional speech recognition IVR” with the “New IVR”.

Technology	Traditional speech IVR	New IVR
Speech recognition	Grammars and statistical language models	Speech-to-text
Natural language understanding (NLU)	Grammars and simple classifiers	Deep learning NLP
Speech generation	Prompt concatenation + some text-to-speech	Mostly text-to-speech

Let’s review the above in greater detail.

Traditional IVR speech recognition

The speech recognition engines traditionally used in speech IVR (e.g., Nuance Recognizer) can’t recognize speech “out-of-the-box”. To recognize speech, they need a speech recognition grammar. There are two main types of grammars:

SRGS grammars are defined by a set of rules, hand-crafted by a grammar developer, which provide a formal definition of the language that can be recognized by the engine. The language defined by SRGS grammars is rigid and only the sentences included in that language can be recognized by the engine. This makes them well suited for directed dialogues, which tend to have a predictable range of user utterances.
Statistical language models (SLMs) are defined by N-grams, which are probabilities of a word given the previous words in the sentence and these probabilities are learned from sample sentences. SLMs provide a much less rigid language model than SRGS grammars so they are better suited for handling spontaneous natural language responses to open-ended prompts (e.g., “How may I help you?”). To perform well, SLMs need a sufficiently large and representative corpus of sentences with which to train the model.

Developing a traditional speech IVR application typically requires creating a separate grammar for each step in the dialogue. Moreover, to achieve sufficient recognition accuracy, these grammars need to be extensively tuned based on real user utterances collected by the application in production.

Developing and tuning these grammars are time consuming tasks that require highly skilled speech scientists. If done well, this can produce very high accuracy and good user experiences. Unfortunately, the tendency to cut corners there is very high, which inevitably results in poor performance and user experience. This is one of the main reasons why speech recognition IVR tends to have such a bad reputation.

Speech-to-text (STT)

In the past several years, we have witnessed breakthrough improvements in speech recognition technology thanks to deep learning. This has made it possible to train speech-to-text (STT) engines that produce high accuracy speech transcription with almost unlimited vocabularies. Nowadays, STT engines are available from a wide range of vendors (e.g., Google STT, Nuance Krypton, Amazon Transcribe, Deepgram, etc.) and there are even open-source versions available.

With STT engines, there is no need to develop grammars at all, so this is a huge time saver when creating conversational IVR applications. This is not to say that speech recognition is a solved problem, far from it. Accuracy remains very much an issue. In fact, we can often achieve significantly better accuracy with well-tuned grammars than with even the best STT engines.

At the moment, the main issues with STT engines are:

Training data. As with any model based on machine learning, the STT model’s performance will be best if its training data is representative of conditions in which it is used. So for instance, if a model was mostly trained on recordings from home speakers primarily involving topics like music playing, weather information, alarms setting and general trivia questions, it may not be optimal for a banking IVR application. Having the ability to fine-tune a STT model on domain-specific data could clearly make a huge difference in accuracy. Unfortunately, most commercial STT vendors don’t make that possible (one notable exception being Deepgram). Nuance does provide a partial solution by making it possible to train a Domain Language Model (DLM) on phrases specific to the target domain.
Contextualization. STT engines can conceptually recognize any user utterance, whether it’s about movies, politics, weather, music, or whatever. That’s very powerful but that’s also a liability in conversational applications, which are usually both domain-specific and highly contextual. If the virtual agent asks a user for a birthdate, then there’s a fairly good chance that the user will respond with a birthdate. The ability to take advantage of such contextual knowledge can greatly improve speech recognition accuracy. Humans do this all the time without even realizing it. Some STT engines do provide some contextualization capabilities (e.g., Google STT model adaptation), but these remain quite limited at the moment.
Optimization. Traditional IVR speech recognition engines provide several effective ways to optimize accuracy. For example, big accuracy gains can be achieved by fine-tuning phonetic transcriptions, modeling intra and inter-word coarticulation, modeling disfluencies, tuning grammar weights, post-processing N-best results, etc. Most STT engines provide few, if any means to optimize accuracy.
Multilingual support. Nu Echo is based in bilingual Montreal and most conversational applications we deploy need to support English words in French sentences and vice-versa (address recognition is a very good example). That can only be done effectively with a speech recognition engine capable of supporting two languages in the same utterance, a feature available in some traditional IVR speech recognition engines, but in no STT engine we know of.

STT technologies are evolving extremely rapidly so we can expect continuously improving accuracy, more effective contextualization and optimization tools, as well as better access to domain-optimized models. In the meantime, the optimal solution may very well be a combination of STT and traditional IVR engines.

Natural language understanding (NLU)

Early speech IVR applications relied exclusively on SRGS grammars for speech recognition, so NLU was not an issue since NLU is built into the grammar.

The use of statistical language models (SLMs) created the need for a separate NLU engine, capable of understanding free-form speech recognition results. Intent detection techniques based on simple machine learning techniques were introduced more than 20 years ago for the purpose of natural language call routing. These techniques have worked quite well, but they typically require a large number of sample sentences per intent to adequately train the model, which is often a big obstacle to get a system up and running.

For a very long time, these techniques didn’t evolve much. Then, deep learning totally changed the landscape for natural language processing technologies. A first impact has been the introduction of word embeddings, which improve generalizability and make it possible to greatly reduce the number of sample sentences required to train NLU models. More recently, large language models (e.g., BERT) and new neural network architectures are providing further improvements.

Note that, although the same NLU technologies are used for both text and voice conversations, there are important differences. For instance, while text conversational systems must to be able to robustly deal with typos, initialisms (eg, “lol”), emoticons, etc., voice conversational systems have to deal with homophone spelling differences (e.g., “coming” vs “cumming”, “forestcrest” vs “forest crest”, or “our our 9” vs “rr9”), undesired normalizations by the STT engine (e.g., “rr1 third concession” → “rr⅓ concession”) and, of course, speech recognition errors.

Some issues with NLU engines include:

Contextualization. Most NLU engines are not contextual (one exception being Dialogflow), which can be a problem since the same utterance can have different interpretations depending on the context. For instance, the meaning of “Montreal” is different depending whether the question was “what’s your destination?” or “what’s the departure city?”
Confidence scoring. Effective repair dialogue requires dependable confidence scores and, unfortunately, NLU confidence scores tend not to be very good. Moreover, NLU scores usually don’t take the speech recognition confidence score into account, which is a big problem since how can we be confident in a NLU result if it’s based on a low confidence speech recognition result? In voice conversational application, effective confidence scores need to take both the STT and the NLU scores into account.
N-best results. Many NLU engines only return the best scoring intent, even when several intents have almost identical scores. Having access to N-best results makes it possible to make better dialogue decisions (e.g., disambiguation) or to choose the best hypothesis based on contextual information not available to the NLU engine.

Natural language processing is currently one of the most active areas of research in artificial intelligence and we expect to see a continuous stream of technological advances making their way into conversational AI systems.

Speech generation

Text-to-speech (TTS) engines have been around for a very long time, but up until recently, the quality and intelligibility wasn’t nearly good enough to provide a good conversational experience. The best speech IVR applications therefore relied almost exclusively on prompts recorded in the studio by professional voice talents. Speech generation for sentences incorporating dynamic data was done with prompt concatenation, which is quite difficult to do well.

But we’ve recently seen such phenomenal progress in TTS technologies that it now makes sense to use TTS instead of studio recordings in most cases. That’s especially true in English, where the quality of the best TTS is such that it’s sometimes difficult to distinguish it from human speech. Moreover, it is now possible to create custom TTS voices that imitate the voice of our favorite voice talent.

The use of TTS technology really is a game changer when it comes to creating and evolving conversational IVR applications since it eliminates the need to constantly go back to the studio to record new prompts any time an application change is required and it avoids all the cumbersome, error-prone manipulations of thousands of voice segments (often in multiple languages). Now, applications can be modified, tested, and released to production almost on-the-fly.

Of course, TTS is not perfect and we still see the occasional glitches, but generally that seems like a small price to pay for the immense added value it provides. The best solution may very well be a combination of recorded audio for those key prompts where we want to get the exact intonation and emotion we’re looking for, with a custom TTS voice built from the same voice talent used in recorded prompts.

Integration with contact center platforms

Traditional speech IVR applications have for a long time relied on mature and time-tested standards for integrating conversational technologies. This includes MRCP for speech recognition and text-to-speech, VoiceXML for dialogue, SRGS for speech recognition grammars, and SISR for semantic interpretation.

Now, with the emergence of a new generation of Cloud contact center platforms and the arrival of the latest deep learning based technologies, all of these are being thrown out the window and replaced with a variety of proprietary APIs and some emerging standards (e.g., gRPC).

What this means is that the integration of these new conversational technologies with contact center platforms remains very much a work in progress, so we find that:

Support for some basic capabilities that we used to take for granted (e.g., barge-in, DTMF support) is not always where it needs to be
The choice of available conversational technologies on many CC platforms remains limited
Even when integrations are available, they often make it very difficult to fully take advantage of the technology’s full potential (e.g., no access to some confidence scores or N-best lists, inability to post-process STT results before sending them to the NLU engine, etc.)

Some solutions are emerging to fill this integration gap. For instance, Audiocodes’s VoiceAI Connect claims to provide “easy connectivity between any CC platform and any bot frameworks or speech engine”. This could make it possible to leverage the conversational technologies that best fit the requirements of any given solution.

The best of both worlds

Deep learning is fundamentally impacting conversational AI technologies and this is profoundly changing the way we conceive the development of IVR applications. We are still very early in that transformation. These novel technologies are still fairly immature and are likely to evolve rapidly in the near future and so is our understanding of how to most effectively leverage them.

Nonetheless, they are already providing some very concrete and transformative benefits. For instance:

It is no longer required to create complex grammars or to collect thousands of SLM training utterances to get speech recognition to work. The best speech-to-text engines provide “good enough” speech recognition accuracy out-of-the-box so it is now possible to have a system up and running quickly.
The latest NLU engines can be trained with probably an order of magnitude fewer sample sentences than with older NLU classification technologies, which also makes it possible to get a first version system up and running very quickly.
The latest text-to-speech technologies are getting so good that it is almost no longer necessary to use recorded prompts (especially in English). This is really a game changer since it greatly shortens the time required to create and deliver a new version of the application, therefore greatly facilitating and accelerating the deployment of enhancements.

The ability to quickly get a first application version up and running is key since it makes it possible to quickly start collecting real conversational data and utterances, which are the raw material with which the system can be continuously enhanced and optimized.

While some of the limitations of STT technologies are being addressed (e.g., in terms of contextualization, optimization, multilingual support, etc.), conversational IVR application developers should consider mixing STT with traditional IVR speech recognition technologies in order to get the best of both worlds and deliver exceptional conversational IVR user experiences (some IVR platforms, for instance the Genesys Voice Platform, make that possible).

The post There is a new IVR in town. Here’s what it means first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post There is a new IVR in town. Here’s what it means appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Takeaways from the VUX World Live Google Contact Centre AI with Antony Passemard

Yves Normandin — Fri, 19 Mar 2021 19:16:32 +0000

Dialogflow CX vs. ES

The interview started with a comparison between Dialogflow CX and ES. CX is not just an incremental improvement over ES. It is in fact a complete redesign, with a more powerful and more intuitive dialog model. It also has a clean separation between intents and dialogue that greatly increases intent reusability and dialogue manageability and a visual builder that can be easily used by Conversational Architects to create complex dialogues with less code.

According to Passemard, this had long been requested by many customers. While Dialogflow ES, which Google Cloud will continue to support and improve, is appropriate for simple dialogues, Dialogflow CX should be the platform of choice for longer and more complex dialogues. In addition, Dialoglfow CX provides several advantages over ES:

More predictable (although not necessarily lower) pricing
Several IVR features (including barge-in, DTMF support, timeouts, retries), which were necessary in order to build conversational IVR.
Support for up to 40,000 intents (compared to 2,000 with ES)
More collaboration features that enable teams to work on large projects more efficiently
Better support for analytics, experiments, and feedback loops
A better NLU engine, based on the latest BERT model.

Anybody can use Dialogflow today. However, for conversational IVR, integrating Dialogflow with a contact centre platform generally remains a challenge. Most IVR specific features require a good integration with the IVR platform and depend on events or parameters to be provided to Dialogflow, whether it is to leverage DTMF for use cases other than numerical parameters, or to use incremental no-input event handlers.

Passemard mentioned that some solutions, such as Audiocodes, can facilitate this integration. Interestingly, he also mentioned that it is best to stream the audio directly to Dialogflow rather than using Google STT to transcribe the audio and send the transcription to Dialogflow. The reason for this is that Dialogflow has an Auto Speech Adaptation feature that automatically optimizes the transcription accuracy based on the agent’s training phrases. That said, our own experience shows that we can often achieve as good or better results by streaming the audio directly to Google STT, using speech adaptation. Moreover, it is often necessary to post-process transcription results in order to make them compatible with Dialogflow’s NLU, which is not possible when streaming audio directly to Dialogflow.

Agent Assist for Voice

The next topic covered in the interview was Agent Assist. This is an important topic for at least two reasons. The first is that there are very promising use cases for Agent Assist. The second is that we’ve heard a lot about CCAI Agent Assist in the past couple of years, but it’s been hard to understand exactly how to access this capability. About this last point, Passemard confirmed what we suspected: there is no public API for Agent Assist voice; Google decided to only make it available through CCAI telephony partners. As mentioned by Simms, this could be a smart business strategy. By working aggressively with telephony partners to integrate Agent Assist with their platforms and reselling only through them, Google may ensure that it becomes the de facto choice for Agent Assist.

The downside, however, is that enterprises are entirely dependent on the contact center vendors’ motivation and ability to make CCAI available to their customer base. It might be a while before many enterprises can leverage CCAI and, when that happens, it might require very expensive upgrades to their contact center infrastructure. For this reason, customers may end up looking for these alternative solutions that will inevitably become available.

This brings me to the Agent Assist use cases. Passemard mentioned that proposing relevant documents to agents based on the conversation wasn’t found to be very useful by customers. Agents don’t want to read through full documents to find the answer to the customer needs. They want extractive search, that can automatically extract the document’s relevant portion. And, we heard, that’s coming soon. What is really taking off at the moment according to Passemard is the ability to automatically fill in forms in real time with information provided by the caller. That’s really powerful. And, of course, a side benefit of Agent Assist is getting a transcription of every single call.

Agent Assist for Chat

Passemard said that Agent Assist for chat has been shown to provide great improvements of agent productivity and satisfaction and CSAT scores. In particular, Smart Reply and Smart Compose capabilities are provided using predictive models trained on the customer’s data, which makes them much more accurate. Agent Assist for chat is currently only available from chat vendors, but a public API is coming out soon.

Insights

The last CCAI capability mentioned is Insights, which is Google’s name for speech analytics. Insights is still in preview, but the good news is that it will be available to all with a public API. Insights is about understanding conversations that are happening in the contact center. Using Insights, enterprises will be able to look at conversations, index them, search through them, do topic modeling and sentiment analysis, navigate within a conversation, and perform NLU-based phrase matching (e.g., “Give me all conversations with a greeting”). Google will support a SIPREC integration.

Final Notes

Passemard mentioned that Conversational AI is probably the first application of AI that has a massive impact on customers. That’s an intriguing claim; it would be interesting to see some data that supports this. He also concluded by strongly advising against underestimating the value of a good Conversational Architect. We couldn’t agree more. It’s definitely not something you learn in two weeks. The very good ones have years of experience and they are critical to the success of any conversational project.

The post Takeaways from the VUX World Live Google Contact Centre AI with Antony Passemard first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Takeaways from the VUX World Live Google Contact Centre AI with Antony Passemard appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Nuance Partner eXperience Summit Review: An accelerated transformation in a fluid market

Yves Normandin — Mon, 02 Mar 2020 21:04:12 +0000

Since Mark Benjamin joined Nuance as its new CEO almost two years ago, the company has been going through a breathtaking transformation. After selling its imaging division to Kofax and spinning off its automotive division, the company now focuses primarily on its core business of providing conversational AI products and solutions.

Even in its core conversational AI business, Nuance is fast transforming itself from a software company with a very large focus on professional services to a product and platform company. This was evident at the 2019 Partner eXperience Summit, but it was even more so at this year’s event.

This change was of course necessary and, I might add, a bit overdue. Long the dominant vendor of enterprise speech technology solutions, Nuance is now being challenged by companies – Google, Amazon, Microsoft, and IBM among others – that offer easy to use conversational AI platforms with state-of-the-art technologies. With these platforms, the claim is that anybody can now develop sophisticated conversational AI solutions; that speech recognition (ASR) and natural language understanding (NLU) work “out-of-the-box” without any need for speech scientists; and that, in fact, you don’t even need developers to build solutions. This is the “do-it-yourself” (DIY) message and it is a compelling one.

Of course, that message is highly misleading. Yes, to some extent, the technology now works “out-of-the-box” in the sense that it is possible to get a simple conversational demo bot up-and-running quickly. With speech-to-text (STT) engines, there is no need to write speech recognition grammars and NLU engines can be trained with a few training phrases per intent. But that’s only good for a demo. Building an effective, enterprise-grade conversational AI system is hard work, no matter what the platform is (more on that in a future blog post).

What is true, though, is that enterprises really are looking for DIY tools. And they are increasingly demanding cloud-native solutions. And, above all, they want flexibility. And Nuance has heard that message loud and clear. They now understand that it’s no longer sufficient to have best-in-class technology and a good professional services organization. Customers want to have flexible development and deployment models.

The most recent big steps that Nuance has taken in that direction are:

Conversational AI APIs (launched November 2019);
The Nuance GateKeeper cloud based security and biometrics suite (launched October 2019);
Nuance Mix: DIY Tooling for partners and end users (general availability planned for end of March)

The introduction of Nuance Mix, in particular, is a big change for a company that is used to directly delivering most of its conversational AI solutions through its professional services organization, using closely guarded development tools. But what we’ve seen so far of Mix is promising, with a slick, contemporary user interface. From a company that has years of experience building and deploying compelling conversational AI solutions, this is quite encouraging.

Nuance is facing powerful new competitors, but it has many advantages. Its technology is top-notch, it has a very large installed base, it offers the most flexible deployment models (premise or cloud), its technology is integrated with most contact center platforms, and it understands better than anybody what it takes to deliver conversational experiences that work not just in demos, but in the real world. Nuance also offers the most extensive capabilities to adapt and optimize the technology for a specific domain and a specific dialog state, which is often what makes the difference between a good demo and an enterprise-grade solution.

Another Nuance differentiator – which they position as a key element of their value proposition – is its strong professional services organization. But that could also turn out to be its Achilles’ heel, because customers no longer want to be dependent on the vendor’s PS; they want to know that there is a large pool of people that are skilled on the technology and have all the tools necessary. It will be a challenge to change a company that is culturally used to delivering all the big projects into one that enables its partners and customers to do it themselves.

In conclusion, Nuance is clearly going in the right direction and making all the right moves, but its plan is ambitious, so execution will be key. Perhaps the biggest challenge will be to implement the culture changes that are required in order to successfully implement this transformation.

We’ve been in this market for close to 20 years and these are by far the most interesting times we’ve seen. We’re expecting quite a ride in the next few years.

The post Nuance Partner eXperience Summit Review: An accelerated transformation in a fluid market first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Nuance Partner eXperience Summit Review: An accelerated transformation in a fluid market appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Question answering experiments with the Dialogflow FAQ Knowledge Connectors

Yves Normandin — Wed, 15 Jan 2020 13:00:57 +0000

Chatbots come in multiple forms and can serve many different purposes. Without pretending to exhaustivity, we can mention

the task-oriented bots, that aim to assist a user in a given set of transactional tasks, like, for example, banking operations
the chit-chat bots, whose primary objective is to mimic casual conversation
and the question answering bots, whose purpose is to, you guessed it, answer user’s questions.

These categories are not mutually exclusive: A task-oriented bot can support some level of small talk, and question answering bots can assist the user in some tasks. These are to be perceived as paradigms more than strict definitions.

In this article, we will focus on the concept of the question answering chatbot, and more specifically on the implementation of this concept in Dialogflow, using Knowledge connectors (still a beta feature at the moment of writing).

About Dialogflow FAQ knowledge connectors

Knowledge connectors are meant to complement the intents of an agent and offer a quick and easy way to integrate existing knowledge bases to a chatbot. Dialogflow offers two types of knowledge connectors: FAQ and Knowledge Base Articles. Here we will mostly focus on the FAQ knowledge connector, which models the knowledge bases as a list of question-answer pairs (QA pairs).

Each QA pair in a FAQ knowledge connector can be seen as a special kind of intent that has a single training phrase and a single text response. At first sight, the main advantages of a FAQ knowledge connector over defined intents seem to be the ease of integrating external knowledge bases and the fact that, contrary to defined intents, more than a single response can be returned (which can be convenient for a search mode).

Are there any other advantages? One of our hypotheses when we started this work was that knowledge connectors would be able to leverage the answer in the QA pair when matching the query, not just the question. This is not explicitly mentioned in the documentation, but it would make sense for two reasons. First, it’s hard to believe that any NLU engine can effectively learn from a single training phrase. There are always many ways to ask a question that don’t look at all like the training phrase. Second, FAQ data sources often have long answers that could conceivably be correct answers to a wide range of questions other than the one provided. When trying to find the correct answer to a user query, it would therefore make sense for the engine to focus as much on finding the answer that best answers the query as on finding the question that best matches the query.

Anatomy of the Knowledge Base

The knowledge base we used was taken from the Frequently Asked Questions (FAQ) section of the website of a North American airport. It contains more than a hundred QA pairs, separated in a dozen categories. Each category contains a number of subcategories ranging from only one to about ten.

While some questions have straightforward answers, others have complex, multi-paragraphs ones. All the answers are primarily composed of text, but many also contain tables, images, and some even contain videos. Many answers also have hyperlinks leading to other parts of the FAQ or external pages.

Minor surgery on the Knowledge Base

While analyzing the knowledge base, we found that several questions only made sense within the context of the category and sub-category in which they appear. For instance, in the Parking section, we have the question “How do I reserve online?”. The FAQ context makes it clear that this is a question about parking reservation, but this information is lost when modeling the knowledge base as a CSV-formatted list of question-answer pairs (QA pairs). We therefore had to modify several of the original questions so that they could be understood without the help of any context. So, in the example above, the question was changed to: “How do I reserve a parking space online?”.

What questions users ask

The airport website offers users two distinct ways to type queries to get answers: one that clearly looks like a search bar and another one that looks like a chat widget that pops when clicking a “Support” button on the bottom right of the web page. Both of them do the exact same thing: They perform a search in the knowledge base and return links to the most relevant articles. However, we believe that the chat-like interface entices more complex, natural queries since the users may believe they are entering a chat conversation.

The airport provided us with a set of real user queries collected from the two query interfaces. This is very important because this tells us what questions users are really asking and it provided us with real user data for our experiments.

Of course, we had to do some cleaning on that data set, as a good number of queries were not relevant for our purpose. Things like digit strings (most likely phone numbers and extensions), flight numbers with no other indications, or purely phatic sentences (for example, “how are you?”). We also observed that the queries could be separated into two groups: either they were really short and to the point, with one or two words at most, or they were long and complex, with lots of information, details, and usually formulated as a question.

Augmenting the corpus

Once the data set was cleaned, we ended up with about 300 queries (down from a little more than 1500!). Clearly, this would not be sufficient for our experiments, so we decided to collect additional data that, we hoped, would still be representative of real user queries.

We considered using crowdsourcing solutions (like Amazon Mechanical Turk) but ultimately decided to try other options. Instead, we used the People also ask and Related searches functionalities of Google Search to glean additional user data. We would start with a user query (real or fabricated) and collect the related questions proposed by Google. One interesting feature of the People also ask functionality is that every time we expand one of the choices, it proposes several additional related questions. This way, we ended up collecting about 300 additional queries with little to no effort, effectively doubling the number of queries we had.

At the same time, we also organized an internal data collection at Nu Echo, where our colleagues would have to write plausible user queries based on general categories that we assigned to them. This gave us over 400 hundred additional queries, bringing our total to about a thousand.

Annotating the corpus

Annotating the corpus consists in manually determining which QA pair in the knowledge base, if any, correctly answers each of the queries in the corpus. While this sounds simple, it proved to be a surprisingly difficult task. Indeed, the human annotator has to carefully analyze each potential answer before deciding whether or not it’s a correct response to the query. For some queries, there was no correct answer, but there were one or several QA pairs that provided relevant answers.

What we ended up doing was separate the corpus in 3 categories:

Queries with a correct answer (an exact match);
Queries without an exact match but with one or several relevant answers (relevant matches);
Queries without any match at all.

Queries in the second category would be labeled with all relevant QA pairs. When we finished annotating, only 33% of the queries had an exact match, even if 91% of the corpus can be considered “in-domain”. An interesting observation is that the FAQ coverage varied significantly based on the source of the queries, as shown in the table below.

Source	Count	Exact match	Coverage
Google	275	133	48.36%
Website queries	303	63	20.79%
Nu Echo	440	150	34.09%
Total	1018	346	33.99%

Our explanation is that the Google queries tended to be simpler and more representative of real user queries, the website queries were often out-of-domain, incomplete or ambiguous. The Nu Echo queries tended to be overly “creative” and generally less realistic.

Train and test set

We split our corpus into a train set and a test set. The queries in the train set are used to improve accuracy while the test set is used to measure accuracy. Note that this is a very small test set. It contains 407 queries, of which only 151 have an exact match (37%). It is also very skewed: The top 10% most frequent FAQ pairs account for 61% of those 151 queries.

Performance metrics

To measure performance, we need to decide which performance metrics to use. We opted for precision and recall as our main metrics. They are defined as follows:

Precision: of all the predictions returned by Dialogflow, how many of them are actually correct?
Recall: of all the actual responses we’d like to get, how many of them were actually predicted by Dialogflow?

In our case, we considered only exact matches and the top prediction returned by Dialogflow. One reason for this is that relevant matches are fairly subjective and we have found that the agreement between annotators tends to be low. Another reason is that this makes comparison with other techniques (e.g., using defined intents) easier since these techniques may only return one prediction.

Since Dialogflow returns a confidence score that ranges from 0 to 1 for each prediction it makes, we can control the precision-recall tradeoff by changing the confidence threshold. For example:

when the threshold is at 0, we accept all predictions, and the recall is at its highest, while the precision is usually at its lowest;
when the threshold is at 1, we exclude almost all predictions, so the recall will be at its lowest, but the precision usually is the highest.

When shown graphically, this provides a very useful visualization that makes it easy to quickly evaluate the performance of an agent against a given set of queries, or to compare agents (see results below).

We’re now ready to delve into some of the experiments we performed. Note that the data that has been used to perform these experiments are publicly available in a Nu Echo GitHub repository.

Experiments with the FAQ Knowledge connector

We took all of the QA pairs we extracted from the airport knowledge base and pushed those to a Dialogflow Knowledge Base FAQ connector. Then we trained an agent and tested this agent with the queries in the test set. Here’s the result.

Ouch! This curve shows, at best, a recall of barely 40%. And that’s with less than 30% precision. Something is definitely wrong here. A first analysis of the results reveals something very interesting: The question in the QA pair that correctly answers the user query is often very different from the query. For instance, the correct answer to the query “Can I bring milk with me on the plane for the baby?” is actually found in the QA pair with the following question: “What are the procedures at the security checkpoint when traveling with children?”. In other words, those two formulations are too far apart for any NLU engine to make the connection. In order to identify the correct QA pair, one really has to analyze the answer in order to determine whether it answers the query.

Unfortunately, Dialogflow seems to mostly rely on the question in the QA pair when predicting the best QA pairs and that creates an issue: The more information there is in a FAQ answer, the more difficult it is to reduce it to a single question.

What if QA pairs could have multiple questions?

Contrary to defined intents, Dialogflow FAQ knowledge connectors are limited to a single question per QA pair. While this makes sense if the goal is to use existing FAQ knowledge bases “as is”, it may limit the achievable question answering performance. But what if we work around that restriction by including multiple copies of the same QA pair, but using different question formulations (different questions, same answer)? This could allow us to capture different formulations of the same question, as well as entirely different questions for which the answer is correct.

Here is how we did it:

We selected the top 10 most frequent QA pairs in the corpus. For each of them, we created several new QA pairs containing the same answer, but a different question (using questions from the train set). We called this the expanded FAQ set.
We created a new agent trained with this expanded set of QA pairs.
We tested this new agent on the test set.

The graph below compares the performance of this new agent with the original one. There is a definite improvement in recall, but precision still remains very low.

FAQ vs Intents

How do defined intents compare with Knowledge Base FAQ? To find out, we created an agent with one intent per FAQ pair. For each intent, the set of training phrases included the original question in the QA pair, plus all the queries in the train set labelled with that QA pair as an exact match. Then we tested this new agent on the test set. The graph below compares this new result with the previous two results.

That is an amazing jump in performance. Granted, these are not great results, but at least we know we are heading in the right direction and that performance could still be improved a lot.

A quick look at Knowledge Base Articles

As mentioned before, Dialogflow offers two types of knowledge connectors: FAQ and Knowledge Base Articles. Knowledge Base Articles are based on the technologies used by Google Search, which look for answers to questions by reading and understanding entire documents and extracting a portion of a document that contains the answer to the question. This is often referred to as open-domain question answering.

We wanted to see how this would perform on our FAQ knowledge base. To get the best possible results, we reviewed and edited the FAQ answers to make sure we followed the best practices recommended by Google. This includes avoiding single-sentence paragraphs, converting tables and lists into well-formed sentences, and removing extraneous content. We also made sure that each answer was completely self-contained and could be understood without knowing its FAQ category and sub-category. Finally, whenever necessary, we added text to make it clear what question was being answered. The edited FAQ answers are provided in the Nu Echo GitHub repository.

The result is shown below (green curve, bottom left). What this shows is that Knowledge Base Articles just doesn’t work for that particular knowledge base. The question is: why?

Although further investigation is required, a quick analysis immediately revealed one issue: Some frequent QA pairs don’t actually contain the answer to the user query, but instead provide a link to a document containing the desired information. This may explain why, in those cases, the Article Knowledge Connector couldn’t match the answer to the query.

Conclusion

We wanted to see whether it was possible to achieve good question answering performance by relying solely on Dialogflow Knowledge Connector with existing FAQ knowledge bases. The answer is most likely “no”. Why? There are a number of reasons:

While defined intents can have as many training phrases as we want, FAQ knowledge bases are limited to a single question per QA pair. This turns out to be a significant problem since it is difficult to effectively generalize from a single example. That’s especially true for QA pairs with long answers, which can correctly answer a wide range of very different questions.
FAQ knowledge bases are often not representative of real user queries and, therefore, their coverage tends to be low. Moreover, they often need a lot of manual cleanup, which means that we cannot assume that the system will be able to automatically take advantage of an updated FAQ knowledge base.
Many user queries require a structured representation of the query (i.e., with both intents and entities) and a structured knowledge base to be able to produce the required answer. For instance, to answer the question “Are there any restaurants serving vegan meals near gate 79?”, we need a knowledge base containing all restaurants, their location, and the foods they serve, as well as an algorithm capable of calculating a distance between two locations.
Many real frequent user queries require access to back-end transactional systems (e.g., “What is the arrival time of flight UA789?”). Again, this cannot be implemented with a static FAQ knowledge base.

The approach we recommend for building a question answering system with Dialogflow is consistent with what Google actually recommends, that is, use Knowledge Connectors to complement defined intents. More specifically, use the power of defined intents, leveraging entities and lots of training phrases, to achieve a high success level on the really frequent questions (the short tail).

Then, for those long tail questions that cannot be answered this way, use knowledge connectors with whatever knowledge bases are available to propose possible answers that the user will hopefully find relevant.

Thanks to Guillaume Voisine and Mathieu Bergeron for doing much of the experimental work and for their invaluable help writing this blog.

Conversational automation initiatives

Take our survey on innovative technologies on the customer experience, and find out the results from the dataset. (All data collected will remain anonymous.)

The post Question answering experiments with the Dialogflow FAQ Knowledge Connectors first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Question answering experiments with the Dialogflow FAQ Knowledge Connectors appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Chatbots, Voicebots, IVA, IVR: Sorting through the confusion

Yves Normandin — Wed, 11 Dec 2019 16:21:25 +0000

In the past few years, we have witnessed the introduction of a bunch of new terms and expressions related to conversational systems and interfaces: chatbots, voicebots, intelligent virtual agents (IVAs), intelligent virtual assistants (IVAs), etc. Unfortunately, all of these tend to mean different things to different people, which ends up generating a lot of confusion in the industry.

In an attempt to, if not eliminate, at least reduce some of that confusion, I’ll propose some broad definitions for these terms.

A chatbot is an automated system with which users interact through a “chat-like” interface. This includes messaging channels such as Messenger, WhatsApp, Slack… but it also includes SMS, iMessage, as well as other chat-like interfaces such as web chats, chat widgets in mobile applications, etc. Although chatbot interactions should primarily be done through text input and output, they in practice increasingly incorporate rich media (depending on what the channel supports) such as buttons, images, carousels, webviews, etc. In reality, many chatbots have little or no support for text input, relying primarily on buttons for user input. A chatbot is not necessarily conversational (see here for an explanation of what we mean by conversational) and in fact most chatbots are highly directed, menu driven “dialogs”.

A voicebot is a chatbot with which users can interact vocally. This assumes that the chatbot behind the voicebot can handle natural language input and it requires a capability to convert voice input into text (or directly into intents), as well as text output into voice. Example voicebots include any bots accessible through a voice channel, which include the now ubiquitous smart home speakers, but also the plain old telephone channel as well as any VoIP channel, for instance the call channels of Skype, Messenger, WhatsApp, Slack, etc. In that sense, a conversational IVR could be seen as a voicebot. Another example would be a Dialogflow voicebot, accessible through any voice channel, that takes advantage of Dialogflow’s ability to detect intent from audio.

An Intelligent Virtual Agent (IVA) is a robot that simulates an agent (which, in this context, really means a contact center agent). It provides some of the services normally provided by a contact center agent through a communication with users – via voice or text channels – that resembles human-to-human communication. For reference, DMG defines an IVA as “A system that utilizes artificial intelligence, machine learning, advanced speech technologies (including NLU/NLP/NLG) to simulate live and unstructured cognitive conversations for voice, text, or digital interactions via a digital persona.” A virtual agent can hence be a chatbot, a voicebot, or both.

An Intelligent Virtual Assistant (also IVA, unfortunately) is a system that is dedicated to helping its user, either by providing useful information or advice (weather or traffic information, financial advice, etc.), by answering questions, or by accomplishing tasks on his/her behalf (e.g., planning meetings, booking hotels, paying bills, whatever). Interaction with an intelligent virtual assistant is often done through text or voice conversational channels, which effectively makes it a chatbot or a voicebot, but it can also be done through mobile or web applications.

An IVR (Interactive Voice Response) is an interactive telephone system that is primarily used in a call center to steer calls to the appropriate agent, and possibly to enable callers to perform some self-service transactions. Most IVR systems today are anything but conversational, relying instead primarily on menu navigation through DTMF (touch-tone) user inputs. Several IVR systems also enable speech input, but most of these only support voice menus and directed dialogs. More recently, natural language call steering applications, which enable callers to state the purpose of their call in their own words, have gained in popularity, but that remains a very small minority of IVR systems out there. The surge in popularity of conversational systems, however, is inevitably now impacting IVR, so expect to see a rapidly increasing number of IVR voicebots being deployed in the near future.

The post Chatbots, Voicebots, IVA, IVR: Sorting through the confusion first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Chatbots, Voicebots, IVA, IVR: Sorting through the confusion appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Does your FAQ stand for Fail to Answer Questions?

Yves Normandin — Mon, 25 Nov 2019 22:00:55 +0000

From FAQs to chatbots: Improve customer experience with conversational question answering.

A significant portion of customer service inquiries is about users wanting an answer to a question. Organizations are rightfully motivated to provide efficient means for users to find answers to their questions autonomously (i.e, without interacting with a human agent) since it can improve user experience while greatly reducing costs by freeing up valuable time for their customer service agents.

To achieve this, organizations traditionally propose a Frequently Asked Questions (FAQ) section on their website and they also often provide a search capability that can return relevant articles from a knowledge base, the website, or both. In many cases, these can provide fairly effective means for users to get the information they’re looking for, therefore reducing pressure on the contact center.

In that context, how can a question-answering customer service chatbot add value? Certainly, that cannot be by providing a chat-like interface to a static FAQ or to an existing website search capability. That just wouldn’t be very compelling (for a discussion on this topic, see Tobias Goebel’s great blog post explaining why we can’t just convert FAQs into a chatbot 1:1).

Chatbot question answering: beyond static FAQs and search

In order to really provide question answering value, a customer service chatbot has to go beyond the FAQ capabilities already provided on the website. This can be achieved in a number of ways, including by:

Directly answering user’s questions rather than providing links to relevant documents. If I ask “Are strollers allowed on airplanes?” I’d like to have a clear response (“Yes, strollers are allowed.”) rather than list of articles that may or may not answer my question.
Truly leveraging a conversational interface, for instance by enabling the chatbot to clarify vague questions:

User: I’m looking for a telephone number
Chatbot: Who would you like to call?
User: Lost items
Chatbot: The lost-and-found telephone number is 123-456-7890

Or by enabling users to ask follow-on questions:

User: Can I bring breast milk on a plane?
Chatbot: Yes, breast milk is allowed on airplanes.
User: What about strollers?
Chatbot: Strollers are also allowed.

Providing dynamic and/or personalized answers, which require access to back-end systems. For instance:

What is the arrival time for flight United 285?
When should I expect to receive my luggage?

Enabling question answering at any time during the course of a chatbot conversation.
Giving users the ability to continue the conversation with a human agent, if the chatbot isn’t able to solve the user’s issue.

In a chatbot, the very frequent queries (the short tail) can – and should – be handled using standard approaches (e.g., with intents and entities). While that requires work to maintain the chatbot to handle those new frequent queries that will inevitably occur, it’s the approach that will provide the best results.

Meanwhile, however, there will always be all those long tail queries that would just require too much effort to try to support that way. So when the chatbot doesn’t have the answer to a question, it is best to fall back to a search-like mode that can automatically leverage all those documents and knowledge bases that you already have. They most likely contain answers to many of these questions. This not only reduces development effort, but it makes it much easier to keep the system up to date with the latest answers.

Search-like capabilities in conversational platforms

Some conversational platforms provide search-like capabilities that make it possible to automatically leverage existing knowledge bases or documents to search for answers to those user queries that the chatbot cannot answer. For instance:

Chatbots developed with Watson Assistant can leverage Watson Discovery for that purpose. Performance can be improved by using Watson Knowledge Studio to teach Watson about the language and relationships that are useful in order to understand your specific domain or industry.
Chatbots developed with Google Dialogflow can leverage Dialogflow’s Knowledge Connectors to search knowledge bases for a response to a user query. Knowledge connectors are offered in two varieties: FAQs and knowledge base articles. FAQs are used to integrate existing Frequently Asked Questions (e.g., from a website). In that case, finding a response means finding the FAQ question-answer pairs (QA pairs) that best match the user query. With knowledge based articles, Dialogflow actually looks for the answer to user queries within the articles and returns the most relevant portion of the article as answer.

In future blog posts, we will report on experiments with some of these platforms. Stay tuned.

The post Does your FAQ stand for Fail to Answer Questions? first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Does your FAQ stand for Fail to Answer Questions? appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

SpeechTEK 2019 Review: Conversational AI is now all about the telephone channel

Yves Normandin — Fri, 03 May 2019 01:39:52 +0000

Conversational AI was clearly one of the biggest themes this year at SpeechTEK (Apr 29 – May 1, 2019, Washington DC). And SpeechTEK being a speech technology conference, the emphasis was naturally on voice, rather than text conversations. Conversational AI is first and foremost about Intelligent Virtual Agents/Assistants (IVAs), which are robots that provide services through a user interface that simulates human-to-human communication.

Of course, there is nothing really new about all this. Chatbots have been all the rage for over three years. What was really new this year at SpeechTEK was this unmistakable feeling that conversational AI over the phone had suddenly become mainstream. Clues were everywhere, starting with the strong participation of Google Cloud, Twilio, and Gridspace, all Diamond Sponsors at the conference. Both Twilio and Google Cloud had keynote sessions, with conversational AI as the main topic. But the strongest hints of all came from casual conversations with attendees, which were by and large seeing telephone IVAs as inevitable and were looking for the best solutions to turn this into a reality.

Telephone calls are not going away

Maybe this has to do with the fact that call volumes into contact centers aren’t going down but customer expectations are going up as a result of their experience with personal assistants like Siri and Google Assistant, as well as smart speakers like Amazon Echo and Google Home. In that context, companies cannot continue to ignore their old IVR system that is increasingly becoming the worst portion of their customer’s journey with them.

The challenge is how to provide that great conversational experience over the telephone. In the chatbot world, the great majority of chatbot developers have long ago realized that it’s hard to make sure that natural language understanding (NLU) technology works well enough to provide a great user experience and have mostly resorted to adopting very directed menu-based interactions, limiting the use of NLU to where it is absolutely necessary. And, by and large, that works quite well for many use cases. But in the IVR world, that’s not an option because directed, menu-based interactions are what most companies offer today and users simply hate it.

Accuracy is key…

In their talks, Google Cloud rightly insisted on the importance of speech-to-text (STT) and NLU accuracy. Last year, Google Cloud introduced enhanced acoustic models for the telephone, which cut error rates in half and I’m sure they will continue to improve accuracy. I also expect STT vendors to eventually introduce features that will enable developers to further improve accuracy by being able to tell the engine what types of responses are most likely. Even if, in principle, users can say anything, the odds are high that if the bot is asking a question, the user will respond to that question rather than say something totally unrelated. Being able to give indications to the STT engine about what users are likely to say can make a huge difference in accuracy. Speech technology vendors like Nuance have known that forever and they make sure that developers have this kind of control.

…but conversational user experience design is critical

Another, but related topic that was discussed at length at SpeechTEK is the importance of conversational user experience (CUX) design expertise and the lack of such expertise in the market. This has been an issue for chatbots, but to a certain extent companies have been able to work around it by leveraging rich media features available on messaging channels. On the telephone, where the interaction is primarily through a voice conversation, CUX expertise is critical. Simply managing a conversation that is natural and productive to the user requires strong CUX skills. But the challenge is even greater in the context where the bot always has to deal with uncertain STT and NLU inputs and therefore has to use efficient repair dialogue to deal with this uncertainty. There was, by the way, a very interesting presentation on this topic at the conference by Bruce Balentine.

Beyond IVAs

Finally, beyond IVAs, there was also a lot of discussion at SpeechTEK on other dimensions of conversational AI, namely, agent assistants and speech analytics. Agent assistants provide real-time guidance to agents based on the analysis of the ongoing conversation between the agent and the customer. This was one of the topics discussed by Google Cloud during their keynote (part of their Contact Center AI offering), but other vendors also presented solutions in that space, namely ttec Associate Assist and Gridspace Relay.

So, all in all, a very interesting conference, from my perspective. Let me know if you detected other important trends at the conference.

The post SpeechTEK 2019 Review: Conversational AI is now all about the telephone channel first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post SpeechTEK 2019 Review: Conversational AI is now all about the telephone channel appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.