Karine Dery - AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo

Voice agents VS. Chatbots: Where does the difference lie?

Karine Dery — Wed, 14 Sep 2022 16:27:02 +0000

In our field of work, we often hear “Once we’re done with the voice assistant, we’ll just use the dialog to add a chatbot on our website!” or “now that our chatbot is done, it will be a piece of cake to make a voice bot”. Seemingly, it looks like we would only need to add or remove the speech processing (speech-to-text, STT) and speech synthesis (text-to-speech, TTS) layers to magically transform a chatbot into a voicebot and vice versa (by the wave of a magic wand).

Based on our experience, we would also describe such a simple transformation as magic!

Through this blogpost, I will illustrate why with a few counterexamples.

Generating the output

Presenting complex information

Within a chatbot, text information can be enriched with images, hyperlinks, slideshows, etc. Some use cases such as navigation assistance or purchase recommendations would seem impossible to implement without those tools.

In other cases, several voice interactions would be required to reach the same result as a single visual output. For example, here is my best shot at transforming the output of a appointment scheduling chatbot for a voicebot:

Trail of previous interactions

What does a chatbot do if the user is not paying attention, has poor memory, or has forgotten to put on their glasses? Nothing! The output remains there for the user to re-read as they see fit, which makes certain cases that are absolutely necessary in a verbal interaction become completely useless in a written conversation:

Persona and voice features

The persona (demographics, language level, personality) of the virtual agent, as well as its consistency, are important in both modes. While in text mode you have to think about the visual output of the chatbot, in voice mode, you have to look for a voice that represents the desired characteristics, while being natural, and this can limit our options. For example, trying to create an informal voice agent can be near impossible, especially when using TTS instead of a recorded voice (which also has its limitations).

https://www.nuecho.com/wp-content/uploads/2022/09/voicebot_cool-en.wav

Support of multiple channels

Finally, even if our use cases are channel-agnostic, our personna very simple and our agent very talkative, it is clear that we must at least be able to play different messages depending on the channel so that SSML is included in audio messages. Unfortunately, some dialog engines hardly support multiple channels and this can greatly increase the challenges of implementing a common agent for both voice and text.

Input interpretation

“What about the other way around? The user won’t send images or carousels of images to the chatbot. For sure, interpreting the input can’t be that different.” I will answer with a dramatic example. Let’s look at Bob who is trying to express what he needs to a vocal agent:

Of course, Bob and his legendary bad luck are not real, but the cases I have presented are taken from real-life examples. Even though some STT models can now ignore “mhms”, noises and secondary voices, the transcription will still have its share of errors.

Uncertainty

There are ways to reduce these errors or their impacts, whether it’s through the configuration of the engine, systematic changes of the transcription, or the adaptation of the NLP model to the sentences received. There remains, however, an additional uncertainty related to the STT which must be taken into account in the development of a voice application.

Strategies for dealing with uncertainty

To increase our confidence in the interpretation of the input, we will use more strategies for dealing with uncertainty in the dialogue of a vocal agent than in the dialogue of a textual agent.

For example, we can think of:

Add a step to explicitly or implicity confirm an intent or an entity
Add a step to disambiguate the input when intentions are too similar
Enable changes or fixes

Choosing use cases

Addresses, emails or people’s names are difficult pieces of information to transcribe correctly for many reasons, but they present lesser challenges in writing. If some of these pieces of information are critical for a use case, it could be very complex, risky, or inappropriate for the user experience to implement it though a vocal agent..

Real-time management

The last big difference between voice and text conversations is time management. A text conversation is asynchronous: the input is received in one block, and the response that follows is sent in one block. The audio, on the other hand, is transmitted continuously, so the time must be managed accordingly.

Short response time and user experience

In a vocal conversation, we are expecting a response within a few tenths of a second, while in text mode, it is completely normal to wait for much longer. Long silences on the phone are uncomfortable, and even if it is possible to play sounds or music-on-hold, between two interactions, the “…” hint cannot be replaced. It is therefore much more critical to ensure that the system is fast and to warn the user in case of a longer operation in voice mode.

Interruptions

Because voice output has a duration, the user can try to interrupt a voice agent. Supporting interruption correctly involves additional technical complexity, but also has additional impact on the dialogue. For example, we want to make the assumption that if the user says “yes” when presenting several options, this means that he chooses the first one, and we will support this case.

User Silence

Although a virtual agent isn’t discomforted by silences, the treatment of what is commonly called a no-input differs greatly depending on the mode of communication. In a voice conversation, a few seconds of silence usually means the user is hesitating or their voice is too low; an appropriate help message will therefore be played.

In text mode, it is useless to harass the user with error messages because the absence of input is treated like any inaction on a website: after a determined time, the user will be disconnected if necessary, and the conversation is ended.

So, finally…

How then does one answer the question: “What can be reused from a voice agent to create a chatbot or vice versa?” The answer is very nuanced and a little disappointing. Switching from a voice agent to a chatbot will generally allow more reuse because the former is generally more restrictive: perhaps it will be enough to adapt the messages a little, to add or remove a few dialogue paths.

However, in both cases, it is important to take a step back and re-evaluate our use cases and our persona: are they appropriate, feasible and realistic on this new channel? For what comes out of this questioning, business rules and high-level flows of the dialogue can probably be reused. The NLU model (textual data, organization of intentions and entities) and the messages of one may serve as a basis for the other, but will be subject to change. Indeed, the approach will have to be adapted to the results of user tests and data collection, so that the user experience does not suffer in favor of the simplicity of development.

The post Voice agents VS. Chatbots: Where does the difference lie? first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Voice agents VS. Chatbots: Where does the difference lie? appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Chloe: the Evolution, or Building a Covid-19 Chatbot with Rasa – Part 2

Karine Dery — Wed, 09 Sep 2020 19:28:37 +0000

Episode 1: NLU and Error Handling

Flashback – Scene 4: Question Answering and Following TED

As I mentioned in the first post, the goal with the Q&A flow was that the user could ask a question about Covid-19 and we would display the answer returned by Mila’s model API. There have been multiple versions of this portion of the application, from very basic to quite complex, and it was integrated in more and more places in the dialogue.

In the first version of the question-answering flow, the user had to choose the “Ask question” option in the main menu or after an assessment, and then we collected the question. We planned for four possible outcomes of the question-answering API call (the fourth was still not implemented in the model when our participation in the project ended):

Failure: API call failed
Success: API call succeeded and the model provided an answer
Out of distribution (OOD): API call succeeded but the model provided no answer
Need assessment: API call succeeded but the user should assess their symptoms to get their answer

If the outcome was a success, Chloe would ask an additional question to know if the answer was useful (the chat widget we used did not provide any kind of thumbs-up/thumbs-down UI to easily skip this interaction). If the outcome was OOD, the user was asked to reformulate.

Collecting the question, the reformulation, and the feedback would be done in a form, but where and how to implement the transitions to the other flows was not so clear. There were 6 different transitions after the Q&A flow depending on the outcome and the presence of the “self_assess_done” slot we described earlier. We vaguely thought about asking the user what they wanted to do next inside the form, to keep Q&A flow logic centralized, but the idea was discarded as we came up with no clean way to implement it, and that’s why we ended up relying on stories and the TED policy to predict the deterministic transitions.

We were also confronted with the problem that some of these transitions were “affirm/deny” questions, “affirm” either leading to an assessment or to asking another question. At this point, our basic assessment stories started directly with “get_assessment”, as a shortcut for memoization, and starting a story with “get_assessment OR affirm” would obviously lead to unwanted matches. We put off this inconvenience with a solution that only worked because we controlled the user input through buttons. Like this:

An intent shortcut with buttons

This way, we did not have to add stories of assessments following a question-answering, but with a step back we should have done it then, since adding stories with Q&A following an assessment worked (mostly) well, and we had to do it anyway when adding NLU.

Flashback – Scene 5.25: Daily Check-In Enrollment Detours

The daily check-in enrollment flow design had been augmented and included unhappy paths. These had to be addressed since the phone number and validation code (added in this version) were collected directly from user text. These are the cases we addressed:

Phone number is invalid
User says they don’t have a phone number
User wants to cancel because they don’t want to give their phone number
Validation code is invalid
User did not receive the validation code and wants a new one sent
User did not receive the validation code and wants to change the phone number

Some of these are something between a digression and error-handling, and we thought of implementing them as “controlled” digressions, as we had already done for the pre-conditions explanations, which went as follows:

But since most of them involved error counters, error messages, or more complexity, we decided to manage them all inside the form instead of separating the logic between forms, stories and intent mappings. It did have downsides (other than the hundreds of lines of additional code): some logic happened over multiple interactions and we had to add many slots for counters and flags to keep track of the progression (our final version of the form uses ten such slots).

Flashback – 5.75: Stopping for a Special Class for TED

After some tests, it was brought to our attention that the user could get stuck in a Q&A OOD loop since we only gave the option to reformulate. The design was changed so that the user could either retry or exit Q&A, and we added 2 more transitions for this case.

Adding these, we hit a thin wall: the TED policy did not learn the correct behaviour after Q&As: it mixed up the impacts of the “question_answering_status” and “symptoms” slots. Re-distributing the Q&A examples equally between assessments with no, mild or moderate symptoms was a clerk’s work, but it worked, and in the end, the policy predicted the correct behaviour on conversations that were not in our stories.

Scene 6: Implementing Testing Sites Navigation on Autopilot

Testing sites navigation, after wrestling with Q&A transitions and daily-ci enrollment error-handling, brought no new challenge. The flow consisted of three major steps:

Explain how it works and ask if the user wants to continue
If so, collect their postal code and validate its format and existence, cancelling after too many errors
Display the resulting testing sites or offer a second try if there were none.

Coherently with our previous implementations, we used a form to collect the postal code and handle errors, make the API calls and offer the second try, and stories to display the explanations and transition to other flows. The transitions, again, varied depending on the API call outcome and on the “self_assess_done” slot.

Scene 7: Exploring the Sinuous Path of NLU

When we finally got to the end of the feature list the fast buttons-no-error-handling way, we could explore integrating NLU and handling unhappy paths. We started with the first input/main menu as a test. Anything that was not part of the options would be sent to the Q&A API, but due to the non-contextual NLU in Rasa, and the fact that we expected a large variety of questions for the Q&A, this “anything”, could be any intent, with any score. “How will we handle all these intents?” was not as trivial as it might seem.

Option 1: Add examples in stories

The straightforward path was to add stories with unsupported intents and the error behavior (directing to the Q&A form), but how many examples would it take? The TED policy could not be expected to learn to use these error examples as catch-alls, and using ORs to include all unsupported intents would have multiplied the training time exponentially as soon as we applied this approach to other cases. This path was a dead-end.

Option 2: Core fallback

If we did not include the unsupported intents in stories, the TED policy would still predict something, but we could hope for the confidence score to be low, and set a threshold to trigger a fallback action. The action would replace the intent by “fallback”, and we could manage this one intent in the stories. But our expected behaviors did not all have very good scores, some not so far from what a misplaced “affirm” could get, since it was in many stories. Thus we did not want to depend on a threshold to trigger the fallback.

The solution: Unsupported intent policy

We ended up using the “fallback” intent idea, but with a deterministic policy. The policy predicted the action to replace the intent if the latest relevant action before the user input was the main-menu question and the intent was not in the list of supported ones. Stories and memorization were used to trigger the Q&A form and manage the peculiar transitions after it (call failure and OOD were followed by a main menu error message instead of the regular messages). To achieve this, the Q&A form was modified to pre-fill the question slot with the last user input if the intent was “fallback”:

Using the trigger message in the question answering form

Scene 8: Further Explorations

As a second step, we added NLU in yes-no questions, which, per design, simply triggered a reformulated question with buttons and no text field. The majority of those were in forms, some with exceptions to the “utter_ask_{slot_name}” message convention. The exceptions also applied to the error messages, thus a generic approach that wouldn’t even apply to all cases seemed too complicated for the benefits of it, and we did not spend time thinking about one. It seemed simpler and faster to just manage everything in the forms like this:

Intermission: Losing the Feedback Phantom Trailing Behind Us

Adding NLU, and consequently flexibility, we were reminded of the mandatory and cumbersome feedback interaction that still haunted us, and decided to make it more flexible, too. We still didn’t have a feedback widget or time to implement one, so we kept the question, but adapted the reaction: if the user answered something other than “affirm/deny”, it would be treated as if they were already in the next question, which offered to ask another question and could lead to the other functionalities. This required a bit of gymnastics to preemptively exit the form and “reproduce” the user input:

Scene 9: Final Sprint to Add NLU

Since we already had the policy to replace intents with “fallback”, error-handling outside forms was mostly a matter of adding entries to the dictionary of latest action-supported intents, and stories to react to the “fallback” intent, either by entering the Q&A form or displaying an error message to follow the design. Inside forms, we applied the same approach as for yes-no questions. We were forced to make some collateral changes, like adding a province entity, or adding stories (mostly ORs though) to manage the transitions where “affirm” or “deny” were valid (now that the buttons shortcut was unavailable). We also had to backtrack on our cleanly handled pre-conditions digression since the simple mapping policy solution could not apply with error-handling, and managed it inside the form like everything else.

The end

Looking back, even though we added NLU, it seems like we took a lot of shortcuts, a lot of not-so-rasa-esque approaches. Our use case, completely predictable, with no random navigation, full of exceptions and tiny variations, did not correspond to a typical Rasa use case. We wrestled with lots of obstacles that come naturally when trying to implement a boxes-and-arrows design with Rasa. But Rasa offers flexibility through code and possible additions, and in the end, we often chose code to represent dialogue patterns because when time is short, the road we know is the safest way to end up where we want.

In a further installment, we will dive deeper into the different ways to implement two of the features of a boxes-and-arrows design, i.e. decision trees and dialogue modularity, that are hard to implement with Rasa, and the various methods to do so. We will also explore if and how Rasa 2.0, still in the alpha stage at the moment of writing, can make this task easier.

The post Chloe: the Evolution, or Building a Covid-19 Chatbot with Rasa – Part 2 first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Chloe: the Evolution, or Building a Covid-19 Chatbot with Rasa – Part 2 appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Chloe: the Evolution, or Building a Covid-19 Chatbot with Rasa

Karine Dery — Tue, 08 Sep 2020 18:00:00 +0000

Context

When the confinement measures started in Canada, we were contacted by Dialogue, a telemedicine provider, to help them migrate Chloe, their Covid-19 rule-based chatbot, to a conversational chatbot, using Rasa, and add new functionalities to the bot. This would be a 10 weeks, agile, iterative project.

Here are Chloe’s high-level functionalities:

Self-assessment: provide personalized recommendations based on one’s symptoms and following federal and provincial recommendations
Question-Answering (Q&A): allow the user to ask Covid-19 related questions using a model developed by Mila
Daily check-in: help the users monitor their symptoms day by day. If the user subscribed to the daily check-in, they receive a link by SMS once a day to connect to Chloe and assess the progression of their symptoms
Screening/testing sites navigation: using their postal code, provide user with a list of testing sites near them with Clinia’s API

The design was made iteratively as we added functionalities, and adjusted to account for comments from Dialogue’s doctors and volunteer testers, constantly moving. The implementation followed not so far behind. There are lots of possibilities on how to implement some dialogue patterns with Rasa, and with the ever-changing designs, our implementation choices often felt like navigating a labyrinth. We made it to the exit, but not without turning back a couple of times before hitting a wall or to avoid a cliff. Without an overview of the final design, we ended up with some inconsistent implementations, and without infinite time, some incomplete refactors. But this context also led us to explore paths we wouldn’t have with Rasa if we’d had the time to identify patterns and create generic components to apply them.

In this post and the next one of this short series, I will tell the tale of how we developed Chloe. For each step of our course, I will describe the main obstacles we faced, and the implementation decisions we made, often in the heat of the moment. In this first installment, we will mostly focus on the self-assessment and daily check-in flows.

Episode 1: Assessment Flows

Scene 1: Sprint to the First Assessment Demo

Very early in the project (i.e. day 8), we were asked if we could demo an assessment flow at the end of the day. When the request for a demo came, the first design was still boiling in our designer’s head; we had a running Rasa project but no dialogue implemented. Nonetheless, we rolled up our sleeves and made it happen.

The initial demo was a simple decision tree to find out the gravity of the user’s symptoms and make appropriate recommendations. We took the straightest path forward: augmented memoization policy with a story representing each possible path. We used buttons and blocked the text input field so we wouldn’t need to train an NLU model or handle unhappy paths.

Scene 2: The Path Becomes Muddy as Assessment Flows Multiply

The next major increment in the flows was to add a distinction, at the entry point of the self-assessment flow, between three situations:

The user thinks they might be sick and wants to assess their symptoms (initial case)
The user has been tested positive and wants to assess their symptoms and get advice
The user has done the self-assessment before and comes back to reassess their symptoms

This distinction created several variations in the basic flow, such as asking if the symptoms worsened in the case of a reassessment, or starting with self-isolation recommendations for someone who tested positive.

We continued on the same path, adding stories to implement these two new flows, although we started noticing the quickly growing number of stories for only three flows (that would continue complexifying) and the repetitions between similar paths.

Two similar stories for a user with mild symptoms

Seeing no simple solution to this in stories – checkpoints and ORs could not help because the similar parts are sandwiched between the different intents and the variations they create – we didn’t make any significant implementation change at this point.

While implementing those three flows, we needed to add a change that applied to all three: after the user says they don’t have severe symptoms, Chloe collects their province of residence and age, to make more precise recommendations. This time, the straightforward path was to put these pieces in a form: we collect information, and it is easily reusable in all our stories.

Scene 3: Fast Lane to Daily Check-In Enrollment

Moving on to another functionality, we implemented the daily check-in enrollment; if the user shows symptoms, Chloe offers the daily check-in. If the user accepts, she collects their name and phone number, notes if they have preconditions that make them more susceptible to complications, etc. This flow was also without a doubt a form. In this first simple version, even though we used free text to collect the first name and phone number, there was no real error-handling: we used the complete user input for the name and extracted all the digits of the input for the phone number, re-asking for it if there were not 10 or 11 digits.

Scene 4: Answering Questions and Following TED

The Q&A functionality is meant to allow the user to ask any question they have about Covid-19, and send that question to the module developed by Mila, receive a response and display it to the user. We wanted to make this feature available in every flow, with different paths leading to it and different paths leading out of it depending on the type of outcome (the different types of outcome will be described, as well as the details of this feature and its evolution, in the next installment of this post).

Since Chloe would not offer an assessment if any type of assessment had already been done in the conversation, the transitions after the Q&A also depended on this fact, multiplying the outward paths. Memoization wouldn’t suffice to learn this difference since we could loop in Q&As over and over. Thus, we added a “self_assess_done” featurized slot combined with assessment+Q&A stories and counted on the TED policy to learn with few examples. It worked, but our stories file grew a lot suddenly.

Intermission: Backtracking to Forms to Avoid a Stories Jungle

Foreseeing this coming multiplication and lengthening of our stories, we decided to transfer the common part of the assessments to a form before completely integrating the Q&As. This would shorten and simplify stories, but also ease the collection of slots (presence of cough or fever, newly added as a separated question, and the degree of symptoms), necessary if the user subscribed to daily check-in. A form meant less repetition but also using intermediary disposable slots to allow the tree-like filling of a single slot to cover the degree of symptoms. The slot was featurized to adapt the daily check-in offer and recommendations after the form in stories.

But this unique assessment form did not last long; the design changed while we were looking away. Two recommendation messages, about self-isolation and home assistance, were replaced by small flows with one question each. The design and implementation around these moved a lot. First, both flows were forms, inserted where the corresponding message was. Then we had to triplicate the assessment form to insert the self-isolation subflow, either before, after or in the middle of it depending on the situation (regular assessment, tested positive or reassessing). Later, the self-isolation flow was moved and modified for each situation, but we kept three separate forms to gradually include the specific questions that were left out of the common version. We kept code in common, but the “how” varied over time, and more details will be given on this subject as the question of modularity will be touched in another post.

At this point, our general model used a combination of stories, forms and actions that can be summarized as follows:

Stories: transition between main flows and subflows, define the sequences of forms, conditions and actions that are possible for each main functionality and high-level flow
Forms: collect pieces of information and define decision trees, handle reusable subdialogues that include at least one question, etc.
Actions: various uses that do not require the collection of information, including displaying multiple messages in a row.

Here is an example of a story at this moment:

Basic self-assessment flow followed by a question

Scene 5: Daily Check-In; Another type of Assessment, a Known Path

The purpose of the daily check-in is to contact the user (who previously enrolled) everyday to assess their symptoms and, among other things, evaluate the progression of these symptoms. An initial question allows to establish which of three situations applies to the user: they feel better than the day before, they feel worse, or there were no changes. Each situation had its own decision tree, and each one of these had variations depending on the symptoms of the day before. While some questions were asked in each flow, overall, there were not enough similarities to reuse significant portions of the dialogue. Therefore, with the experience we had in implementing self-assessment flows, we knew that the better way to implement the daily check-in flows would be through three separate forms.

Scene 5.5: Daily Check-In all the Way

There was far more than the assessment to the daily check-in: an “invalid URL” flow (id in URL sent to the user to access the daily check-in does not exist), a one-click offer to opt-out before the assessment, another one, varying depending on that day’s symptoms, after the assessment, and a set of recommendations at the end. The invalid URL flow was added as stories because it merely directed to other features. The opt-out options were added as other forms since we collected information and had to call our database. The recommendations started as an action to remain separated as a different flow, called as a followup action when necessary in the daily-ci end keep-or-cancel form. Then we realized that followup actions still had to appear in stories, and when we added the transitions to other functionalities, it made more sense to directly include the recommendations in the form instead.

In the Next Episode

In this first installment, we described how we used stories and forms to implement the many variants of the self-assessment and daily check-in flows. While stories were appropriate at first to define simple decision trees with few branches, it quickly became obvious that they are not the best tool to implement complex decision trees, conditional branching or reusable subflows. We therefore had to create several forms that were embedded in stories, and rely on stories to manage higher-level flows.

In the next steps of the project, we built on the initial functionalities to add the following features:

• We expanded and improved the Q&A flows

• We added the testing sites navigation

• We added NLU support, first to portions of the flows, and ultimately everywhere

These additions brought new challenges in how we used Rasa, not only in defining and developing the dialogue but also in incorporating NLU and ensuring its performance and accuracy were adequate.

The next installment of this series will explore these topics.

The post Chloe: the Evolution, or Building a Covid-19 Chatbot with Rasa first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Chloe: the Evolution, or Building a Covid-19 Chatbot with Rasa appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Rasa Summit, Chatbot Conference, etc.: An Overall Impression

Karine Dery — Tue, 29 Oct 2019 18:00:05 +0000

A couple of weeks ago, I had the opportunity to attend Bot Week in San Francisco. In addition to the main events – Rasa Summit and Chatbot Conference – I attended every event of the week to make the most of my stay in this innovative city, and oh! it was worth it. Not only did I learn a lot but I also met many interesting and interested people, important actors of the bots (voice and chat) ecosystem, and heard about exciting use cases and technologies. This full immersion gave me a renewed perspective on what has been done in this area and what is left to explore, and I will try, through this blogpost, to give you a glimpse of what I learned.

The Vision

Like an evening star, leading innovators on the path to the new era of chatbots and voicebots, is a Vision, a Vision slowly leaving sci-fi movies to enter our reality: the omnipotent personal virtual assistant (OPVA). Imagine having your own OPVA. Or let’s call it Jarvis, like Iron Man’s. Imagine having your own Jarvis (iron suit sold separately). Jarvis is with you everywhere; Jarvis is your own personal vocal Google search; Jarvis starts your coffee pot 10 minutes before you wake up; Jarvis even reschedules your dentist appointment behind your back, because you unknowingly booked your camping trip the same week. This is the Vision.

Multiple speakers talked about the Vision, and/or the path to it. This path is generally represented as 5 levels of AI assistants. For more information, you can read Rasa’s CEO Alex Weidauer’s take on the 5 levels from an enterprise point of view, or for a summary, these equivalences with some of Jarvis’s skills:

Notification Assistant: Does not support user input, only sends messages
Jarvis: The external temperature outside is 1,000 °C, this might become dangerous for your suit.
FAQ Assistant: One-step interactions, answers generic questions:
Tony: What’s iron’s melting point?
J: 1,538 °C
Contextual Assistant: Answers contextual questions if context is explicitly given:
T: Can you send a message to Pepper?
J: Sure. What is the message?
T: “I will be late for dinner due to some complications, love you.”
J: Got it.
Personalized Assistant: Knows the user, their preferences, has, or appears to have, some form of understanding of the user’s world:
T: Can you notify my wife I might be late due to some complications?
J: Sure, I will let Pepper know you will not be with her for dinner as expected.
Autonomously Organized Assistants: Services are interconnected and user does not need to intervene:
J: Your blood pressure is dropping. May I suggest you head to the nearest hospital?
T: I’m okay, I just need to…
T: Faints
J: Sir? I didn’t understand. (pause) Your vital signs indicate you might have lost consciousness, I will bring you to the hospital if you do not explicitly cancel.
J: Starts auto-pilot to the nearest hospital, notifies Pepper and also notifies the hospital of the incoming patient.

Current Jarvis or Where Are Bots Now?

I remember, a couple years ago, all these “Build a Bot in 10 Minutes” blogs and tutorials, and how every dialogue engine was sold as the easiest and fastest way to create a chatbot. Many were trying to sell their own cheap version of this fashionable new toy.

I was more than happy to find that no one sells this idea anymore. The ideal chatbot shifted from easy-built to personalized, efficient and conversational, as attested by the hype around Erica, (which, as a Canadian, I did not really hear about before the conference). Bank of America’s (large) team spent months working on it, and are still tuning it and enriching its vocabulary and skills. Pretty far from one person building a chatbot while making a deposit… Not only is it accepted that a bot needs a significant amount of thought and work beforehand, but also that it needs attention afterwards, using analytics and new user data for continuous improvement. Thus, the market has evolved, and lots of new companies emerged in the last couple years, offering tools and expertise to facilitate this continuous work.

Here are those who stood out the most by their strong presence during the week:

Design tools: BotMock and BotSociety
Area-specific building tools: Smartloop for leads and sales

N.B.: For a more exhaustive list, refer to the agendas of the events.

Special mention – Robocopy: The emergence of conversational bots in the last few years gave birth to the Conversation Designer job title. Many of those who wear this title are former UI designers, copywriters or linguists, and until now, the related knowledge was sparse in bot design tools guidelines or blog posts. I think Robocopy’s Conversational Academy arrival marks a milestone in this field; it is becoming an area of expertise in itself, more and more defined every day. I can’t judge the quality of their courses based only on the fascinating talk of their co-founder, Hans Van Damm, but putting this knowledge together can only be a push in the right direction.

On the Conversational Aspect

But to create a bot, technology needs to support the design. According to Alex Weidauer, technology has allowed to create efficient question-answering bots (level 2) for a few years (still not a ten minutes job though, training the natural language understanding (NLU) model and handling exceptions seamlessly demands work), and now allows level 3 bots, i.e. contextual assistants/bots. The next step would be achieving level 4 (other special mention to Aigo who seem to have accomplished it for the daily tasks of a home assistant).

Upcoming Jarvis or What’s Coming Up Next?

RCS

The first talk at Chatbot Conference was Sean Badge from Google on Rich Communication Services (RCS), an overdue rich-content protocol that is slowly replacing SMS. It is a step towards integrated enterprise assistants, allowing them to connect with the user on one network, without forcing them to install separate apps.

5G and Edge Computing

At Mobile Monday’s Future of Voice and Smart Speakers, discussions revolved around how cloud computing is slowing down assistants and preventing voicebot conversations to feel natural because of network latency. Imagine talking to one of your friends on the phone, and each time you stop talking, there’s a 1 second silence before they answer normally. You would wonder if your friend was one of the first victims of a robot takeover. In the same way, when virtual assistants do this, it only reminds us that it is not a human on the line.

Edge computing, i.e. distributed computing near where it is needed, is probably the solution to this annoying latency, and 5G, allowing to connect more devices together and being faster, makes it closer than it ever was. Voicebots could eventually be more like that friend who starts talking before the end of your sentence because they can predict the last words. The polite version.

The Rasa Experience

As we are trying to make AI assistants more conversational and conversations more human-like, Rasa, as a dialogue engine, stands out as a promising technology for two reasons:

The use of machine learning (ML) on the conversational level (and not only NLU)
Their open-source codebase

We have been happily using Rasa for several months now, so the first advantage was already obvious to me: ML probably holds the key to machines acting like humans in a variety of contexts, since hard-coding every single reaction would be a colossal task, if not impossible. Consequently, Rasa being ML’s advocate in conversation management, it has an edge its competitors do not. But it is only by attending the Rasa Summit that I could appreciate the advantages of the second point. A self-evident one is that open source means easy customization. It also means on-premise deployment, which is a plus for organizations managing sensitive user data like banks, insurance or health care providers, three of the biggest owners of customer service chatbots (at least in the USA). And last but not least, a refreshing community feel exhales from Rasa events, because they put a significant emphasis on community and value their contributors. They can retain people and enterprises, make them contribute joyfully, bring new ideas and technology, while aligning their product vision/roadmap with community requirements.

About Voice

Working for a company that has been bringing “conversational” and IVR together for years, I could not ignore how voice channels were discussed at these conferences. They did have a significant, but not central, place in Bot Week, and it’s logic: how odd and inefficient would Jarvis be, if only available by chat? The more bots become conversational, the less we can ignore that language starts with voice, and that for this same reason, voice assistant usage rises.

It is generally accepted that designing a voicebot is different from designing a chatbot because of the limited content that can be sent back. However, I noticed that bot developers, me included, tend to forget something important, a fact expressed simply by Emily Lonetto from VoiceFlow at Slack’s Building the Bots of the Future event: voice might be the easiest, fastest and most portable channel to ask for things, but often not the good one to receive them. Indeed, for a single piece of information, you would expect Jarvis to answer verbally, but for a full report, you would expect a whole interactive 3D hologram (equivalent to an email or pile of paper from a real human assistant).

I think that this idea of a distinct output channel tends to be left behind for two reasons:

In some voice channels or for some users who do not have the appropriate device (an Echo Show with Alexa for example), a visual output might be impossible.
The idea of designing one bot, with the same flow, the same NLU model for all channels, with only the need to adapt the response, is tempting. While most bot-building platforms are designed with this workflow in mind, this over-simplification limits the possibility to send an output on a second channel.

Another cause of this simplification is probably that voice assistants’s Speech-to-text (STT) algorithms are unaware of the NLU model. Surprisingly, no one mentioned the problem of this approach, which seems unavoidable to me. I will illustrate it with a true example that happened to me a few months ago while testing a bot over voice with such system.

Context: I was testing a banking app, and was asked if I wanted to make a recurring or a one-time payment and answered “one time”. I could see the intermediary STT results of my audio stream, and here’s what I got:

One (I am not finished talking yet)
One time (Cool it works)
One time (It is waiting for me to say something else i guess…)
Fun time (Final result. Wait what?)

Obviously, my dialogue flow fell into error state. The correct hypothesis was not chosen (and even replaced!) because the speech recognition model was unaware of the kind of answer it should have been expecting. STT technology sure is getting better and better at eliminating noise, understanding accents and using the user’s history, his location or other contextual information, but user specific information is not always available, e.g. in a phone call. Moreover, in this situation, the sound quality can be far behind the quality a voice assistant can get because of many factors (low bandwidth, low resolution, microphone, etc.), which multiplies the risks of an incorrect transcription.

Maybe in an innovative town like San Francisco, people do not talk about an “aging” medium like telephony, but we work with IVR systems everyday, and know that large call centers are still a reality for many organizations, and will continue to be for years to come. With cell phones being so omnipresent, the phone remains the easiest means of communication for urgent situations such as calling the insurance company after a car crash.

It turns out that in this IVR universe, for the aforementioned reasons, technologies like VoiceXML did and still close the gap between speech recognition and NLU. They should not be overlooked as they can be used to bring the newer chatbot technology to legacy call center installations (as we did with Rasa and the Rivr Bridge). Then one day, with technological advancements like Dialogflow’s Auto speech adaptation, speech recognition, visual recognition, language understanding and conversation management will all work hand in hand in constant communication in Jarvis’s circuits, as it happens in our own brains.

The post Rasa Summit, Chatbot Conference, etc.: An Overall Impression first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Rasa Summit, Chatbot Conference, etc.: An Overall Impression appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

Developing Conversational IVR Using Rasa Part 2: The Rivr Bridge

Karine Dery — Tue, 09 Jul 2019 13:00:31 +0000

On the Rivr Bridge…

I know, there’s only one question on your mind right now: “What in the world is that name?”. First off, “Rivr” is because it uses Rivr, subtly mentioned in the first post, which is a Nu Echo created, open-sourced framework to write VoiceXML applications, entirely in Java. Then “Bridge” because it links a VoiceXML platform with the chosen dialogue engine. And yes, the pun was intended (but not by me).

But the real question is: “What does it do?”. As I said, Rivr is a framework to develop full-fledged applications, but the Rivr Bridge’s goal is only to translate what comes in and out of the VoiceXML platform and throw it to the Rasa side of the world in a digestible format. For instance, a classic Rivr application would programmatically process each user input and define the next dialogue steps, unlike the Rivr Bridge, which would query the chosen dialogue engine to decide the next dialogue steps. Adapting the model was simple, maybe even simpler than we thought. It roughly looks like this:

The great advantage of using the Rivr Bridge is that it interprets the VoiceXML platform’s input and generates bulletproof VoiceXML. For reusability purposes, we decided to make the Bridge platform-agnostic and application-agnostic, and let an IVR channel on the Rasa side manage the Rasa-specific aspects, which would allow us to eventually plug in other dialogue engines.

Here is an artistic representation of our input pipeline:

… Through an IVR JSON Protocol…

To better define the content of the requests and responses exchanged by the Rivr Bridge and the IVR channel, we designed a generic JSON protocol that could represent all necessary information for a conversational IVR application using VoiceXML. The protocol describes 5 types of input, namely: data (initialization data for example; caller’s phone number or any information the platform is set to return), user input (vocal or using the keypad) recognition/interpretation result, recording (of the user’s voice), transfer details (status, duration, etc.), event (hangup, noinput, nomatch…). Concerning outputs, we only designed support for interaction (the dialogue asks for a user input) and exit/hangup to cover our use cases.

As an example, to ask a question and wait for the answer, the dialogue could send this payload:

And the result sent by the Bridge could be:

… To the IVR Channel

Not a lot was then left for the IVR channel to do. Concerning inputs, each one would need some processing to be made accessible to the dialogue management. Specifically, inputs have to fit into Rasa’s NLU result format (namely, a string following the template: `intent@confidenceScore{“entityType”: entityValue, …}`). With well written grammars, this step’s implementation was rather simple for recognition results, but could have been tricky for input types with no intent nor entities (data, events), for which we still wanted to trigger a dialogue turn. To solve that problem, we could either create synthetic intents and entities representing the information we wanted to pass on, or insert it directly in the tracker and send a semantically empty input. We went for the first option, and created four synthetic intents to date:
start_conversation (with a data entity containing initialization data as a JSON object)
– noinput
– nomatch
– hangup

For the outputs, yet again some formatting was necessary, but since Rasa gives us full liberty on the output content through custom payloads, this was pretty straightforward. The (tiny bit more) delicate work was to concatenate and validate outputs from different parts of the dialogue. Rivr supports playing messages alone (without a recognition or hangup step), and it could be a nice feature for our Rasa dialogues, but would have required a bit more gymnastics in both the channel and the Bridge, so we chose not to implement it for now.

Ok, presenting it like that, maybe the IVR channel had a lot to do even with the use of the Rivr Bridge. But it was still less than generating VoiceXML content would have been. Thanks Rivr! To discover the journey of those user inputs once they enter the Rasa ocean, read the yet-to-come rest of the series!

The post Developing Conversational IVR Using Rasa Part 2: The Rivr Bridge first appeared on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.

The post Developing Conversational IVR Using Rasa Part 2: The Rivr Bridge appeared first on AI Virtual Voice Experts with Google Dialogflow CX - CCAI - Nu Echo.