Deepgram today revealed it has developed a more advanced artificial intelligence (AI) model, dubbed Nova 3, that enables speech-to-text (STT) communications in near real time.

Based on a latent space architecture that compresses representation of data points in a way that preserves only the essential features needed to inform the input data’s underlying structure, Nova 3 is capable of encoding complex speech patterns into a highly efficient representation while providing higher levels of accuracy

That approach enables the Nova 3 model to, for example, accurately transcribe speech in environments such as restaurants where there tends to be a lot of background noise that needs to be filtered, says Deepgram CEO, Scott Stephenson.

The Nova 3 model enables real-time transcription across multiple languages, including applications that require domain-specific terminology, such as emergency response services. Nova-3 is also the first voice AI model to enable self-serve customization by allowing users to fine-tune the model for specialized domains without requiring deep expertise in machine learning, says Stephenson.

Keyterm Prompting, for example, makes it possible to improve transcription accuracy by optimizing up to 100 key phrases without having to retrain the underlying model, he notes.

Those capabilities enable Nova-3 to outperform other models in both batch and streaming use cases, with consistently lower Word Error Rates (WER), says Stephenson. The company claims Nova-3 achieves a WER of 5.26%, extending its lead over the next-best competitor by 47.4% (10% WER).

At the same time, organizations can apply polices and controls to redact sensitive information in real time to ensure compliance and data privacy mandates are met.

That level of performance and accuracy will be especially critical when developing AI agents that are destined to be incorporated into a wide range of real-time application environments, says Stephenson. Many of those AI agents will need to provide a level of interactive engagement that will require low-latency translations, he adds. “Agents are going to be connected to a brand identity,” says Stephenson.

Deepgram provides organizations with access to a range of AI models for voice applications, that, depending on latency and cost requirements, address different classes of use cases. Those AI voice models are already being invoked by more than 200,000 application developers that have already transcribed more than one trillion words using the platform, notes Stephenson.

There are, of course, plenty of options when it comes to accessing AI models but when cost, performance, data sovereignty and compliance issues are considered, the Deegram platform provides the most amount of flexibility across multiple use cases, Stephenson says. In contrast, the AI voice models made available by hyperscalers are far more limited in terms of the number of use cases they practically enable, he adds.

Regardless of approach, it’s already clear that voice-enabled AI applications will increasingly become the default user experience that no text-based set of prompts entered into a chat interface will ever match in terms of natural interactivity. The challenge now is finding a way to provide those types of user experience at a level of cost the average organization can afford to widely employ.

TECHSTRONG TV

Click full-screen to enable volume control
Watch latest episodes and shows

Networking Field Day

TECHSTRONG AI PODCAST

SHARE THIS STORY