Welcome to the first installment of our series on the development of interactive voice response (IVR) applications using Rasa!
There has been a lot of buzz around Rasa recently (version 1.0 has been released a few weeks ago, along with Rasa X). At Nu Echo, we have been quite busy since the beginning of the year comparing various alternatives to develop conversational IVR applications (including building our own in-house engine). We finally decided to go with Rasa since it offers a good balance between functionality, flexibility, extensibility and control over the solution, as it can easily be managed both on-premises and in the cloud. Rasa is also open source and has a vibrant community. And let’s not forget about the possibility of creating dialogues using examples and machine learning (more on that later in the series)!
The lasting relevance of VXML
While it is definitely possible to develop IVR applications in 2019 without using VoiceXML (think Dialogflow or Amazon Lex through Amazon Connect), it is still very prevalent in the large contact centers, which are an important part of our customer base. That’s why we decided to give Rasa a serious spin to find out if it could be a viable solution for developing multilingual conversational IVR applications using VoiceXML.
The use cases
For our proof of concept, we selected some banking use cases (account balance and pay bill) that offer us interesting dialogue patterns like digressions, confirmations, corrections and global commands (cancelling a task for example).
Here is an example that mixes some of those patterns:
Introducing Rasa IVR
Developing VoiceXML IVR applications using Rasa offers interesting challenges. For one, the real-time aspect of a voice conversation that must progress if the user says nothing is quite different from the classic chatbot approach. The application can’t keep the user waiting if he says nothing, it must propose alternate and more detailed messages and eventually terminate the conversation if the user decides to stay silent.
While Rasa offers a lot of prebuilt channels, nothing exists to express the richness of VoiceXML and interpret its different outputs. As you can see from the complexity of its specification, a lot must be done to cover all the functionalities (although some of them are less used than others). Some of the most important ones that must be taken into consideration for developing basic use cases are related to constructing the output (audio files / speech synthesis), activating bargein, specifying grammars / input mode (speech and/or DTMF), and configuring confidence levels / timeouts.
The user experience (UX) must also be tailored for the voice channel since the user cannot scroll up to access the whole conversation (unless he has a supernatural memory). Some patterns like confirmation or choosing in a list are much trickier to properly implement using voice than text and widgets.
Along with those challenges come some interesting opportunities. For example, using automatic speech recognition (ASR) alongside contextual grammars allows us to greatly improve the recognition accuracy by giving a greater weight to the most probable responses. VoiceXML also offers many functionalities for a better integration to the contact center, which must be exposed (agent transfer, attached data, recordings). The synchronous aspect of the conversation also simplifies the implementation since the user can’t frantically send multiple (sometimes contradictory) inputs.
As I said earlier, this post is the first of a series that will cover different aspects of the making of our conversational banking application proof of concept. Stay tuned for more articles from my colleagues on our approach toward generating VoiceXML, dialogue management, cloud deployment and more!