And that’s nothing new, really. We used to call that “speech recognition IVR” and we’ve been delivering these conversational experiences for 20 years.
What is new is that there are now novel technologies and platforms that promise to make it much faster and easier to create these conversational experiences while greatly expanding the range of tasks that virtual voice agents (that’s what we call them) can handle.
These novel technologies initially emerged in the context of voice assistants (Siri, Amazon Echo, Google Home) and are in the process of fundamentally changing the way IVR solutions are developed.
To understand how, let’s compare the “Traditional speech recognition IVR” with the “New IVR”.
Grammars and statistical language models
Natural language understanding (NLU)
Grammars and simple classifiers
Deep learning NLP
Prompt concatenation + some text-to-speech
Let’s review the above in greater detail.
Traditional IVR speech recognition
The speech recognition engines traditionally used in speech IVR (e.g., Nuance Recognizer) can’t recognize speech “out-of-the-box”. To recognize speech, they need a speech recognition grammar. There are two main types of grammars:
- SRGS grammars are defined by a set of rules, hand-crafted by a grammar developer, which provide a formal definition of the language that can be recognized by the engine. The language defined by SRGS grammars is rigid and only the sentences included in that language can be recognized by the engine. This makes them well suited for directed dialogues, which tend to have a predictable range of user utterances.
- Statistical language models (SLMs) are defined by N-grams, which are probabilities of a word given the previous words in the sentence and these probabilities are learned from sample sentences. SLMs provide a much less rigid language model than SRGS grammars so they are better suited for handling spontaneous natural language responses to open-ended prompts (e.g., “How may I help you?”). To perform well, SLMs need a sufficiently large and representative corpus of sentences with which to train the model.
Developing a traditional speech IVR application typically requires creating a separate grammar for each step in the dialogue. Moreover, to achieve sufficient recognition accuracy, these grammars need to be extensively tuned based on real user utterances collected by the application in production.
Developing and tuning these grammars are time consuming tasks that require highly skilled speech scientists. If done well, this can produce very high accuracy and good user experiences. Unfortunately, the tendency to cut corners there is very high, which inevitably results in poor performance and user experience. This is one of the main reasons why speech recognition IVR tends to have such a bad reputation.
In the past several years, we have witnessed breakthrough improvements in speech recognition technology thanks to deep learning. This has made it possible to train speech-to-text (STT) engines that produce high accuracy speech transcription with almost unlimited vocabularies. Nowadays, STT engines are available from a wide range of vendors (e.g., Google STT, Nuance Krypton, Amazon Transcribe, Deepgram, etc.) and there are even open-source versions available.
With STT engines, there is no need to develop grammars at all, so this is a huge time saver when creating conversational IVR applications. This is not to say that speech recognition is a solved problem, far from it. Accuracy remains very much an issue. In fact, we can often achieve significantly better accuracy with well-tuned grammars than with even the best STT engines.
At the moment, the main issues with STT engines are:
- Training data. As with any model based on machine learning, the STT model’s performance will be best if its training data is representative of conditions in which it is used. So for instance, if a model was mostly trained on recordings from home speakers primarily involving topics like music playing, weather information, alarms setting and general trivia questions, it may not be optimal for a banking IVR application. Having the ability to fine-tune a STT model on domain-specific data could clearly make a huge difference in accuracy. Unfortunately, most commercial STT vendors don’t make that possible (one notable exception being Deepgram). Nuance does provide a partial solution by making it possible to train a Domain Language Model (DLM) on phrases specific to the target domain.
- Contextualization. STT engines can conceptually recognize any user utterance, whether it’s about movies, politics, weather, music, or whatever. That’s very powerful but that’s also a liability in conversational applications, which are usually both domain-specific and highly contextual. If the virtual agent asks a user for a birthdate, then there’s a fairly good chance that the user will respond with a birthdate. The ability to take advantage of such contextual knowledge can greatly improve speech recognition accuracy. Humans do this all the time without even realizing it. Some STT engines do provide some contextualization capabilities (e.g., Google STT model adaptation), but these remain quite limited at the moment.
- Optimization. Traditional IVR speech recognition engines provide several effective ways to optimize accuracy. For example, big accuracy gains can be achieved by fine-tuning phonetic transcriptions, modeling intra and inter-word coarticulation, modeling disfluencies, tuning grammar weights, post-processing N-best results, etc. Most STT engines provide few, if any means to optimize accuracy.
- Multilingual support. Nu Echo is based in bilingual Montreal and most conversational applications we deploy need to support English words in French sentences and vice-versa (address recognition is a very good example). That can only be done effectively with a speech recognition engine capable of supporting two languages in the same utterance, a feature available in some traditional IVR speech recognition engines, but in no STT engine we know of.
STT technologies are evolving extremely rapidly so we can expect continuously improving accuracy, more effective contextualization and optimization tools, as well as better access to domain-optimized models. In the meantime, the optimal solution may very well be a combination of STT and traditional IVR engines.
Natural language understanding (NLU)
Early speech IVR applications relied exclusively on SRGS grammars for speech recognition, so NLU was not an issue since NLU is built into the grammar.
The use of statistical language models (SLMs) created the need for a separate NLU engine, capable of understanding free-form speech recognition results. Intent detection techniques based on simple machine learning techniques were introduced more than 20 years ago for the purpose of natural language call routing. These techniques have worked quite well, but they typically require a large number of sample sentences per intent to adequately train the model, which is often a big obstacle to get a system up and running.
For a very long time, these techniques didn’t evolve much. Then, deep learning totally changed the landscape for natural language processing technologies. A first impact has been the introduction of word embeddings, which improve generalizability and make it possible to greatly reduce the number of sample sentences required to train NLU models. More recently, large language models (e.g., BERT) and new neural network architectures are providing further improvements.
Note that, although the same NLU technologies are used for both text and voice conversations, there are important differences. For instance, while text conversational systems must to be able to robustly deal with typos, initialisms (eg, “lol”), emoticons, etc., voice conversational systems have to deal with homophone spelling differences (e.g., “coming” vs “cumming”, “forestcrest” vs “forest crest”, or “our our 9” vs “rr9”), undesired normalizations by the STT engine (e.g., “rr1 third concession” → “rr⅓ concession”) and, of course, speech recognition errors.
Some issues with NLU engines include:
- Contextualization. Most NLU engines are not contextual (one exception being Dialogflow), which can be a problem since the same utterance can have different interpretations depending on the context. For instance, the meaning of “Montreal” is different depending whether the question was “what’s your destination?” or “what’s the departure city?”
- Confidence scoring. Effective repair dialogue requires dependable confidence scores and, unfortunately, NLU confidence scores tend not to be very good. Moreover, NLU scores usually don’t take the speech recognition confidence score into account, which is a big problem since how can we be confident in a NLU result if it’s based on a low confidence speech recognition result? In voice conversational application, effective confidence scores need to take both the STT and the NLU scores into account.
- N-best results. Many NLU engines only return the best scoring intent, even when several intents have almost identical scores. Having access to N-best results makes it possible to make better dialogue decisions (e.g., disambiguation) or to choose the best hypothesis based on contextual information not available to the NLU engine.
Natural language processing is currently one of the most active areas of research in artificial intelligence and we expect to see a continuous stream of technological advances making their way into conversational AI systems.
Text-to-speech (TTS) engines have been around for a very long time, but up until recently, the quality and intelligibility wasn’t nearly good enough to provide a good conversational experience. The best speech IVR applications therefore relied almost exclusively on prompts recorded in the studio by professional voice talents. Speech generation for sentences incorporating dynamic data was done with prompt concatenation, which is quite difficult to do well.
But we’ve recently seen such phenomenal progress in TTS technologies that it now makes sense to use TTS instead of studio recordings in most cases. That’s especially true in English, where the quality of the best TTS is such that it’s sometimes difficult to distinguish it from human speech. Moreover, it is now possible to create custom TTS voices that imitate the voice of our favorite voice talent.
The use of TTS technology really is a game changer when it comes to creating and evolving conversational IVR applications since it eliminates the need to constantly go back to the studio to record new prompts any time an application change is required and it avoids all the cumbersome, error-prone manipulations of thousands of voice segments (often in multiple languages). Now, applications can be modified, tested, and released to production almost on-the-fly.
Of course, TTS is not perfect and we still see the occasional glitches, but generally that seems like a small price to pay for the immense added value it provides. The best solution may very well be a combination of recorded audio for those key prompts where we want to get the exact intonation and emotion we’re looking for, with a custom TTS voice built from the same voice talent used in recorded prompts.
Integration with contact center platforms
Traditional speech IVR applications have for a long time relied on mature and time-tested standards for integrating conversational technologies. This includes MRCP for speech recognition and text-to-speech, VoiceXML for dialogue, SRGS for speech recognition grammars, and SISR for semantic interpretation.
Now, with the emergence of a new generation of Cloud contact center platforms and the arrival of the latest deep learning based technologies, all of these are being thrown out the window and replaced with a variety of proprietary APIs and some emerging standards (e.g., gRPC).
What this means is that the integration of these new conversational technologies with contact center platforms remains very much a work in progress, so we find that:
- Support for some basic capabilities that we used to take for granted (e.g., barge-in, DTMF support) is not always where it needs to be
- The choice of available conversational technologies on many CC platforms remains limited
- Even when integrations are available, they often make it very difficult to fully take advantage of the technology’s full potential (e.g., no access to some confidence scores or N-best lists, inability to post-process STT results before sending them to the NLU engine, etc.)
Some solutions are emerging to fill this integration gap. For instance, Audiocodes’s VoiceAI Connect claims to provide “easy connectivity between any CC platform and any bot frameworks or speech engine”. This could make it possible to leverage the conversational technologies that best fit the requirements of any given solution.
The best of both worlds
Deep learning is fundamentally impacting conversational AI technologies and this is profoundly changing the way we conceive the development of IVR applications. We are still very early in that transformation. These novel technologies are still fairly immature and are likely to evolve rapidly in the near future and so is our understanding of how to most effectively leverage them.
Nonetheless, they are already providing some very concrete and transformative benefits. For instance:
- It is no longer required to create complex grammars or to collect thousands of SLM training utterances to get speech recognition to work. The best speech-to-text engines provide “good enough” speech recognition accuracy out-of-the-box so it is now possible to have a system up and running quickly.
- The latest NLU engines can be trained with probably an order of magnitude fewer sample sentences than with older NLU classification technologies, which also makes it possible to get a first version system up and running very quickly.
- The latest text-to-speech technologies are getting so good that it is almost no longer necessary to use recorded prompts (especially in English). This is really a game changer since it greatly shortens the time required to create and deliver a new version of the application, therefore greatly facilitating and accelerating the deployment of enhancements.
The ability to quickly get a first application version up and running is key since it makes it possible to quickly start collecting real conversational data and utterances, which are the raw material with which the system can be continuously enhanced and optimized.
While some of the limitations of STT technologies are being addressed (e.g., in terms of contextualization, optimization, multilingual support, etc.), conversational IVR application developers should consider mixing STT with traditional IVR speech recognition technologies in order to get the best of both worlds and deliver exceptional conversational IVR user experiences (some IVR platforms, for instance the Genesys Voice Platform, make that possible).