A couple of weeks ago, I had the opportunity to attend Bot Week in San Francisco. In addition to the main events – Rasa Summit and Chatbot Conference – I attended every event of the week to make the most of my stay in this innovative city, and oh! it was worth it. Not only did I learn a lot but I also met many interesting and interested people, important actors of the bots (voice and chat) ecosystem, and heard about exciting use cases and technologies. This full immersion gave me a renewed perspective on what has been done in this area and what is left to explore, and I will try, through this blogpost, to give you a glimpse of what I learned.
Like an evening star, leading innovators on the path to the new era of chatbots and voicebots, is a Vision, a Vision slowly leaving sci-fi movies to enter our reality: the omnipotent personal virtual assistant (OPVA). Imagine having your own OPVA. Or let’s call it Jarvis, like Iron Man’s. Imagine having your own Jarvis (iron suit sold separately). Jarvis is with you everywhere; Jarvis is your own personal vocal Google search; Jarvis starts your coffee pot 10 minutes before you wake up; Jarvis even reschedules your dentist appointment behind your back, because you unknowingly booked your camping trip the same week. This is the Vision.
Multiple speakers talked about the Vision, and/or the path to it. This path is generally represented as 5 levels of AI assistants. For more information, you can read Rasa’s CEO Alex Weidauer’s take on the 5 levels from an enterprise point of view, or for a summary, these equivalences with some of Jarvis’s skills:
- Notification Assistant: Does not support user input, only sends messages
Jarvis: The external temperature outside is 1,000 °C, this might become dangerous for your suit.
- FAQ Assistant: One-step interactions, answers generic questions:
Tony: What’s iron’s melting point?
J: 1,538 °C
- Contextual Assistant: Answers contextual questions if context is explicitly given:
T: Can you send a message to Pepper?
J: Sure. What is the message?
T: “I will be late for dinner due to some complications, love you.”
J: Got it.
- Personalized Assistant: Knows the user, their preferences, has, or appears to have, some form of understanding of the user’s world:
T: Can you notify my wife I might be late due to some complications?
J: Sure, I will let Pepper know you will not be with her for dinner as expected.
- Autonomously Organized Assistants: Services are interconnected and user does not need to intervene:
J: Your blood pressure is dropping. May I suggest you head to the nearest hospital?
T: I’m okay, I just need to…
J: Sir? I didn’t understand. (pause) Your vital signs indicate you might have lost consciousness, I will bring you to the hospital if you do not explicitly cancel.
J: Starts auto-pilot to the nearest hospital, notifies Pepper and also notifies the hospital of the incoming patient.
Current Jarvis or Where Are Bots Now?
I remember, a couple years ago, all these “Build a Bot in 10 Minutes” blogs and tutorials, and how every dialogue engine was sold as the easiest and fastest way to create a chatbot. Many were trying to sell their own cheap version of this fashionable new toy.
I was more than happy to find that no one sells this idea anymore. The ideal chatbot shifted from easy-built to personalized, efficient and conversational, as attested by the hype around Erica, (which, as a Canadian, I did not really hear about before the conference). Bank of America’s (large) team spent months working on it, and are still tuning it and enriching its vocabulary and skills. Pretty far from one person building a chatbot while making a deposit… Not only is it accepted that a bot needs a significant amount of thought and work beforehand, but also that it needs attention afterwards, using analytics and new user data for continuous improvement. Thus, the market has evolved, and lots of new companies emerged in the last couple years, offering tools and expertise to facilitate this continuous work.
Here are those who stood out the most by their strong presence during the week:
- Design tools: BotMock and BotSociety
- Area-specific building tools: Smartloop for leads and sales
- Chatbot platforms: Rasa (of course)
Analytics: BotAnalytics and DashBot.
N.B.: For a more exhaustive list, refer to the agendas of the events.
Special mention – Robocopy: The emergence of conversational bots in the last few years gave birth to the Conversation Designer job title. Many of those who wear this title are former UI designers, copywriters or linguists, and until now, the related knowledge was sparse in bot design tools guidelines or blog posts. I think Robocopy’s Conversational Academy arrival marks a milestone in this field; it is becoming an area of expertise in itself, more and more defined every day. I can’t judge the quality of their courses based only on the fascinating talk of their co-founder, Hans Van Damm, but putting this knowledge together can only be a push in the right direction.
On the Conversational Aspect
But to create a bot, technology needs to support the design. According to Alex Weidauer, technology has allowed to create efficient question-answering bots (level 2) for a few years (still not a ten minutes job though, training the natural language understanding (NLU) model and handling exceptions seamlessly demands work), and now allows level 3 bots, i.e. contextual assistants/bots. The next step would be achieving level 4 (other special mention to Aigo who seem to have accomplished it for the daily tasks of a home assistant).
Upcoming Jarvis or What’s Coming Up Next?
The first talk at Chatbot Conference was Sean Badge from Google on Rich Communication Services (RCS), an overdue rich-content protocol that is slowly replacing SMS. It is a step towards integrated enterprise assistants, allowing them to connect with the user on one network, without forcing them to install separate apps.
5G and Edge Computing
At Mobile Monday’s Future of Voice and Smart Speakers, discussions revolved around how cloud computing is slowing down assistants and preventing voicebot conversations to feel natural because of network latency. Imagine talking to one of your friends on the phone, and each time you stop talking, there’s a 1 second silence before they answer normally. You would wonder if your friend was one of the first victims of a robot takeover. In the same way, when virtual assistants do this, it only reminds us that it is not a human on the line.
Edge computing, i.e. distributed computing near where it is needed, is probably the solution to this annoying latency, and 5G, allowing to connect more devices together and being faster, makes it closer than it ever was. Voicebots could eventually be more like that friend who starts talking before the end of your sentence because they can predict the last words. The polite version.
The Rasa Experience
As we are trying to make AI assistants more conversational and conversations more human-like, Rasa, as a dialogue engine, stands out as a promising technology for two reasons:
- The use of machine learning (ML) on the conversational level (and not only NLU)
- Their open-source codebase
We have been happily using Rasa for several months now, so the first advantage was already obvious to me: ML probably holds the key to machines acting like humans in a variety of contexts, since hard-coding every single reaction would be a colossal task, if not impossible. Consequently, Rasa being ML’s advocate in conversation management, it has an edge its competitors do not. But it is only by attending the Rasa Summit that I could appreciate the advantages of the second point. A self-evident one is that open source means easy customization. It also means on-premise deployment, which is a plus for organizations managing sensitive user data like banks, insurance or health care providers, three of the biggest owners of customer service chatbots (at least in the USA). And last but not least, a refreshing community feel exhales from Rasa events, because they put a significant emphasis on community and value their contributors. They can retain people and enterprises, make them contribute joyfully, bring new ideas and technology, while aligning their product vision/roadmap with community requirements.
Working for a company that has been bringing “conversational” and IVR together for years, I could not ignore how voice channels were discussed at these conferences. They did have a significant, but not central, place in Bot Week, and it’s logic: how odd and inefficient would Jarvis be, if only available by chat? The more bots become conversational, the less we can ignore that language starts with voice, and that for this same reason, voice assistant usage rises.
It is generally accepted that designing a voicebot is different from designing a chatbot because of the limited content that can be sent back. However, I noticed that bot developers, me included, tend to forget something important, a fact expressed simply by Emily Lonetto from VoiceFlow at Slack’s Building the Bots of the Future event: voice might be the easiest, fastest and most portable channel to ask for things, but often not the good one to receive them. Indeed, for a single piece of information, you would expect Jarvis to answer verbally, but for a full report, you would expect a whole interactive 3D hologram (equivalent to an email or pile of paper from a real human assistant).
I think that this idea of a distinct output channel tends to be left behind for two reasons:
- In some voice channels or for some users who do not have the appropriate device (an Echo Show with Alexa for example), a visual output might be impossible.
- The idea of designing one bot, with the same flow, the same NLU model for all channels, with only the need to adapt the response, is tempting. While most bot-building platforms are designed with this workflow in mind, this over-simplification limits the possibility to send an output on a second channel.
Another cause of this simplification is probably that voice assistants’s Speech-to-text (STT) algorithms are unaware of the NLU model. Surprisingly, no one mentioned the problem of this approach, which seems unavoidable to me. I will illustrate it with a true example that happened to me a few months ago while testing a bot over voice with such system.
Context: I was testing a banking app, and was asked if I wanted to make a recurring or a one-time payment and answered “one time”. I could see the intermediary STT results of my audio stream, and here’s what I got:
- One (I am not finished talking yet)
- One time (Cool it works)
- One time (It is waiting for me to say something else i guess…)
- Fun time (Final result. Wait what?)
Obviously, my dialogue flow fell into error state. The correct hypothesis was not chosen (and even replaced!) because the speech recognition model was unaware of the kind of answer it should have been expecting. STT technology sure is getting better and better at eliminating noise, understanding accents and using the user’s history, his location or other contextual information, but user specific information is not always available, e.g. in a phone call. Moreover, in this situation, the sound quality can be far behind the quality a voice assistant can get because of many factors (low bandwidth, low resolution, microphone, etc.), which multiplies the risks of an incorrect transcription.
Maybe in an innovative town like San Francisco, people do not talk about an “aging” medium like telephony, but we work with IVR systems everyday, and know that large call centers are still a reality for many organizations, and will continue to be for years to come. With cell phones being so omnipresent, the phone remains the easiest means of communication for urgent situations such as calling the insurance company after a car crash.
It turns out that in this IVR universe, for the aforementioned reasons, technologies like VoiceXML did and still close the gap between speech recognition and NLU. They should not be overlooked as they can be used to bring the newer chatbot technology to legacy call center installations (as we did with Rasa and the Rivr Bridge). Then one day, with technological advancements like Dialogflow’s Auto speech adaptation, speech recognition, visual recognition, language understanding and conversation management will all work hand in hand in constant communication in Jarvis’s circuits, as it happens in our own brains.