AGI, AI, GenAI, security, AI brain

The conversational AI space, with its various intelligent virtual assistants and chatbots, is booming, with the market expected to expand more than 23% each year through 2030 and almost 80% of CEOs wanting to improve their companies’ customer relationship strategy by investing in conversational AI.

A driver in the growing space is the rising demand for multimodal capabilities, which is the ability to use multiple ways to interact with humans.

In a recent report, Cem Dilmegani, principal analyst with market research firm AIMultiple, wrote that in 2021, there were more than 120 million voice assistants in the United States, a number that will continue to rise.

“We predict that as the numbers increase, voice searches will become more prevalent in online transactions,” he wrote. “Such a paradigm will force ecommerce companies to design multimodal conversational AI tools which can respond to both text and voice.”

IT vendors are moving in that direction. The research firm noted that Meta and Google both are working on digital assistant efforts that use multimodal AI.

Simultaneous Modalities

Executives with say the company is well ahead of competitors in this area, with its EVA (Enterprise Virtual Assistant) platform not only able to take in information from a human, but also to respond with a range of modalities – including text, voice, facial expressions and tone of voice – and to do so at the same time.

A key to this capability is an approach calls “Temporal Behavior Analysis of Multi-Modal Conversations In a Question and Answer System,” which enables virtual agents to interact in a more human-like fashion that the company says not only improves the customer experience but also gives enterprises deeper insights about those customers. recently was issued a patent for the technology, which not only is already in the EVA platform, but also can be deployed at the edge in Apple iOS and Android mobile apps.

“A true multi-modal system can take input in any form, or any combination, and use the sum of that multi-modal input to decide how to respond,” CEO Raj Tumuluri told “A lot of systems claim multi-modality even when they do one modality at a time, [such as] speech-only or touch-only or type-only. What we do differently from most is the use of simultaneous multimodality.”

For example, in EVA’s case, the platform’s avatar can simultaneously show empathy in its facial expression, intonation and hand gestures “automatically based on the conversation and mental state of the user,” he said.

Both Input and Output

That includes both in the input and output, Tumuluri said. For the former, that means using those multiple modalities to convey information to the virtual agent, such as pointing to something, asking how much it costs or telling it that you want two of one item but not another.

“This implies that the system should have the capability to interpret the various sensory inputs the user provides by fusing them together,” he said. “For output, multimodality has two components. The first is an awareness of what modalities are available to be used and the second is what modality – or modalities – are the best to convey the output.”

Such capabilities will be increasingly important as intelligent virtual agencies, chatbots and other natural language processing systems expand and mature, according to the CEO.

“A single modality system using text or voice sequentially will never be able to fully comprehend a human intent absent other modalities of expression,” Tumuluri said. “Intonation and facial and bodily expressions are vital to human understanding and comprehension. Using multi-modality Embodied virtual assistants can come close to the human level of understanding and expression of empathy and engagement.”

Real-Life Uses

He pointed to several situations where such multimodal capabilities are useful, such as reporting an insurance claim and being able to picture and annotate it while explaining what happened, rather than describing where on the windshield and how the crack occurred.

“The same goes for pointing out the transactions on your credit card billing statement that you don’t recognize by simply touching and saying ‘This, this and this are not made by me’  and the system will automatically retrieve the information from all those transactions and prepare a list of unrecognized transactions by itself,” Tumuluri said. makes the technology available to iOS and Android users via application containers that run an on-device version of the EVA platform capabilities. It can also run on the device, even if the device is offline.

“For large models, we will automatically load low footprint trained versions of our models when offline, and then seamlessly transition from that to cloud based on available bandwidth,” he said. “This has applications in health care (not all places in a hospital allow Wi-Fi devices), retail (coverage at remote warehouses can be challenging), manufacturing (interference from machinery) and defense (real-time translation of speech in both directions needed with no internet connection).”