In this post, we'll focus on what conversational AI is, how it works, and what platforms exist to enable data scientists and machine learning engineers to implement this technology. So, if you are interested in building a conversational AI bot, this article is for you.
What is conversational AI and how is it different from traditional chatbots?Conversational AI is the area of artificial intelligence that deals with questions on how to let people interact with software services through chat- or voice-enabled conversational interfaces and make this interaction natural. In general, it is a set of technologies that work together to help chatbots and voice assistants process human language, understand intents, and formulate appropriate, timely responses in a human-like manner.
Simply put, it is an AI you can talk to via text or voice.
Text and voice-enabled conversational botsFor example, imagine you are in your comfy bed ready to fall asleep but the lights are still on. Instead of getting out of bed and switching off the lights manually, you can open an Alexa app on your phone and give a voice command like, “Alexa, bedroom lights off.”
Or you want to find out the opening hours of a clinic, check if you have symptoms of a certain disease, or make an appointment with a doctor. So, you go on the clinic’s website and have a textual conversation with a bot instead of calling on the phone and waiting for a human assistant to answer.
This brings us to the question of how conversational AI is different from rule-based chatbots.
Traditional or rule-based chatbots are software programs that rely on a series of predefined rules to mimic human conversation or perform other tasks through text messaging. They follow a rigid conditional formula — if X (condition) then Y (action). Such chatbots may use simpler or more complex rules, but they can’t answer questions outside of the defined scenario.
A common bot response to an unforeseen scenarioAlso, such chatbots do not learn from interactions with a user: They only perform and work with the previously known scenarios you wrote for them.
An AI-based chatbot is also software programmed to mimic human conversation. But unlike rule-based systems, these chatbots can improve over time through data and machine learning algorithms. Basically, AI chatbots can do two core things that differentiate them, namely:
- generalize — a bot can understand patterns and add context to a conversation, going beyond the defined script; and
- get better from human feedback — when a user provides additional information and corrects a bot’s mistakes, you can use those corrections to automate learning for the model to improve.
Let’s dive deeper into core technologies that enable machines to do these awesome things.
Conversational AI key concepts and technologiesHuman language is incredibly complex, and so is the task of building machines that can understand and use it. For example, there are a bunch of possible ways for a user to state that they need their password reset. From simple "Reset my password" and "I forgot my password" to more complex "I'm locked out of my email account" and "I can't sign into my account."
Our brains are wired to be good at understanding all of that, but computers are not. That’s why conversational AI systems need some help in the form of smart technologies to execute communication in a human-like manner. Or at least approach it.
High-level architecture of conversational AIAs you can see from the image above, there are a lot of pieces of the tech puzzle involved. So, it’s worth reviewing the key concepts before we dive into how conversational AI works.
Automatic speech recognition (ASR) or speech-to-text is the conversion of speech audio waves into a textual representation of words. ASR is applied to analyze audio data and parse sound into language tokens for a system to process them and convert them into text. If conversing via text-only, the system excludes this piece of tech.
Machine learning (ML) is a field of knowledge that deals with creating various algorithms to train computer systems on data and enable them to make accurate predictions on new data inputs. ML emphasizes adjustments, retraining, and updating of algorithms based on previous experiences.
Deep learning (DL) is a specific approach within machine learning that utilizes neural networks to make predictions based on large amounts of data. Neural nets are a set of algorithms in which the input data goes through multiple processing layers of artificial neurons piled up on top of one another to provide the output. Deep learning enables computers to perform more complex functions like understanding human speech.
Natural language processing or NLP is the heart and soul of any conversational AI system. It is a branch of Artificial Intelligence that enables machines to process and understand human speech. NLP relies on linguistics, statistics, and machine learning to recognize human speech and text inputs and transform them into a machine-readable format.
There are two subfields in natural language processing:
- Natural language understanding (NLU), as the name suggests, is about understanding human language and recognizing the underlying intent. It uses syntactic and semantic analysis of text and speech to extract the meaning of what’s said or written.
- Natural language generation (NLG) is the process of creating a human language text response based on some data input.
With these concepts in mind, let’s look under the hood of a typical conversational AI architecture to see how everything works.
How conversational AI worksWhile conversational AI systems may be built differently, the architecture commonly comprises a few core elements that breathe life into what we know as intelligent assistants.
An example of a conversational AI pipelineThese components are:
- a user interface — a text or voice interface visible to users so that they can interact with an AI assistant;
- an input analyzer module — an NLU component that extracts meaning from a user’s natural language input and passes it further;
- a dialogue management module (dialogue engine) — an ML-powered module that manages dialogue state and coordinates building the assistant’s response to the user;
- an integrations module (optional) — an orchestrator that allows you to connect to third-party data via APIs; and
- an output module — a component that uses natural language generation to create a response.
User interfaceEverything starts with a user’s input also known as an utterance, which is literally what the user says or types. In our case, this is the textual sentence, “What will the weather be like tomorrow in New York?” The utterance goes to the user interface of a conversational platform. This is where you can rely on your preferred messaging or voice platform, e.g., Facebook Messenger, Slack, Google Assistant, or even your own custom bot.
Input analyzer moduleSo, when the message is picked up by a system, it is sent to the input analyzer module where the NLU engine resides. The main idea here is to take the unstructured utterance in a simple human language and extract structured data that an assistant then can use to understand what the input is about.
Utterance split into intent and entitiesThe user’s utterance is generally understood to contain two main components.
Intent is what the user wants to achieve. The intent is generally centered on a key verb but this verb may not be explicitly included in the utterance. In our example, the intent is “request weather.”
Entities are parameters or details that are important to an assistant to be more specific on what actions it can take. Entities can be names, dates, times, company names, etc. Here, there are two entities — time (tomorrow) and location (New York).
To apply structure to the unstructured text and extract intents and entities, the NLU component has two parts.
How intent classification worksIntent classification. It finds meaning that applies to the entire utterance. It commonly includes the following stages:
- tokenization — the process of defining the borders of units within the text, such as the beginning and end of words;
- word vectorization — the process of mapping words or phrases to corresponding vectors of numbers (generally speaking, converting words into numbers) to create a sentence representation; and
- intent classification — the process of predicting the label for the user input based on vectors using an ML model, for example, a supervised machine learning model called Support Vector Machine (SVM).
- Part of Speech tagging — the process of assigning a grammatical meaning to the words (whether it is a noun, a verb, an adjective, etc.);
- named entity recognition — the process of chunking the sentence and detecting the named entities such as person names, location names, company names, etc. from the text; and
- entity extraction — the process of mining the value and the label of the entity.
Dialogue management moduleThe extracted details that include intent and entities are then passed to the dialogue management module. It also uses the power of machine learning. This can be a simple algorithm like a Decision Tree or a more complex model like Deep Learning Neural Networks.
For the showcase, we’ll take Recurrent neural networks (RNNs) that are often used in developing chatbots, and text-to-speech technologies.
RNNs are the type of neural nets that have sort of looped connections, meaning the output of a certain neuron is fed back as an input. These nets can consider sequential data and understand the context of the whole piece of text, making them a perfect match for creating chatbots. Apart from intent and entity input, RNNs can be fed with corrected outputs and third-party information.
So, this module of conversation AI takes the details about the input from the NLU model, considers the history of the conversation, and uses all this information to predict what an assistant should do – how to respond at the specific stage of the conversation.
Integrations moduleThe architecture may optionally include integrations and connectors to the backend systems and databases. This is an orchestrator module that may call an API exposed by third-party services. In our example, this can be a weather forecasting service that will give relevant information about the weather in New York for a particular day.
Output moduleOnce an assistant is able to create a response, it sends the details to the output module. The response has two parts: an action (performed by the orchestrator) and a text response. Providing an answer based on the recognized intent is the task of natural language generation. It interprets the received data, creates a narrative structure, combines words and parts of the sentences, and applies grammatical rules. The dialogue engine finally responds to the user via the user interface with an answer, “It’ll be cloudy and 71℉.”
The number of dialogue rounds can be infinite. With each round, conversational AI gets better at predicting user intents and providing more accurate and relevant responses.
It’s worth noting that depending on the system, you may see slightly different terminology than the one we used in our example.
Conversational AI examples across industriesConversational AI systems have a lot of use cases in various fields since their primary goal is to facilitate communication and support of customers.
Conversational AI in healthcareThe healthcare industry can greatly benefit from using conversational AI as it helps patients understand their health problems and quickly direct them to the right medical professionals. It can also reduce the load on call centers and eliminate call drop-offs.
Clara — Coronavirus self-checker bot by US CDCWith the heavy hit of the Covid-19, the global population started searching for information about the disease and its symptoms. Emergency hotlines were flooded with phone calls, so plenty of people were left without any help. The US Centers for Disease Control decided to create its own conversational AI assistant called Clara to let everyone find out more about coronavirus, check their symptoms, and see whether they should be tested for COVID-19. The bot greatly helped increase people's awareness of the disease and what should be done if a person thinks they have signs of the condition.
Conversational AI in travelConversational AI is the future of highly efficient traveling. Customers can communicate with chatbots to find inspiration on where to go on a vacation, complete hotel and airline bookings, and pay for it all.
Let’s take, for example, virtual host Edward from Edwardian Hotels London. This cool AI chatbot uses text messaging to provide hotel guests with personalized information and assistance. He can find the nearest vegetarian restaurant if you wish or point you to where the towels are in your room.
And there are a lot of other types of chatbots designed specifically for the travel and hospitality domain. You can learn more on the topic in our dedicated article explaining how to build a bot that travelers will love.
Conversational AI in bankingIn 2018, Bank of America introduced its AI-powered virtual financial assistant named Erica.
Chatbot displaying account balance and offering advice. Source: ChatbotGuide.orgThe smart banking bot helps customers with simple processes like viewing account statements, paying bills, receiving credit history updates, and seeking financial advice. During the third quarter of 2019, digital clients of Bank of America had logged into their accounts 2 million times and had made 138 million bill payments. By the year’s end, Erica was reported to have 19.5 million interactions and achieved a 90% efficiency in answering users’ questions.
Keep in mind that these are just three examples out of thousands. In reality, conversational AI applications can be found in every domain.
How to approach conversational AI: Platforms to considerThere are quite a few conversational AI platforms to help you bring your project to life. We’ll take a look at the most popular ones.
Please note: The off-the-shelf solutions we’re reviewing in this section are for general understating of the market, not for promotional purposes.
IBM Watson AssistantPioneering the domain, IBM offers an AI platform called Watson Assistant that enables developers and business users to collaborate and build conversational solutions. It is feature-rich and integrates with various existing content sources and applications. IBM claims it is possible to create and launch a highly-intelligent virtual agent in an hour without writing code.
IBM is known for being a leading conversational AI platform for huge enterprises. For example, Humana — one of the largest insurance providers in the US — fields thousands of routine health insurance queries per day with a voice-enabled, self-service conversational AI developed with Watson. This Voice Agent serves callers at one-third of the cost of Humana’s previous system and doubles the call containment rate.
RasaRasa is an open-source framework for creating conversational AI. The framework consists of two Python libraries: Rasa NLU and Rasa Core. The former aims at enabling an assistant to understand what a user means in terms of intent and derive details of user input. Another piece of this tech stack puzzle is Rasa Core — an ML-based development and management system. Rasa allows you to build and run conversational AI locally, use your own data, and tune models to your specific use cases.
Google DialogflowGoogle also has a wide array of software services and prebuilt integrations in its catalog. Google’s Dialogflow is the primary service used for conversational AI.
At its core, Dialogflow is an NLU platform that helps design and integrate a conversational user interface into your mobile app, web application, device, bot, interactive voice response system, and so on. Dialogflow is set to analyze various user input types and provide responses through text or with synthetic speech.
The platform comes in two shapes, each of which has its own agent type, user interface, API, client libraries, and documentation:
- Dialogflow CX is for building large and complex virtual agents; and
- Dialogflow ES is suitable for small and simple agents.
Amazon LexAmazon Lex is a fully managed artificial intelligence (AI) service with advanced NLU models to design, build, test, and deploy conversational interfaces in applications. It integrates easily with other AWS cloud-based services as well as external interfaces. Amazon Lex uses the same foundation as Alexa. Amazon’s Alexa is available on more than 100 million devices worldwide and is configurable with more than 100,000 Alexa skills. These skills are usually command interpreters, but you can use Lex to build conversational AIs as well.
Based on the use case, it may be more sensible to build your own custom conversational AI system without relying on any of the existing solutions. More difficult in terms of realization, this is a good way to ensure that the end result will meet all of your desired criteria.