The future of large language models will increasingly involve audio, and a Shanghai Innovation Institute-led project is keeping open source in the loop with MOSS-Audio, one of a series of audio-focused AI projects from the lab. 

Just transcribing speech into text is old school. There is no reason why models can’t perceive acoustic clues, and recognize emotions, interpret audio sounds, and identify speakers – all through complex multi-step inference and the ability to reason over a temporal context. 

MOSS-Audio is one of an emerging generation of models that take in all the additional ambient information that floats around recorded conversations and spoken word presentations. 

OpenMOSS not only understands speech, but also user intent, environmental sounds and even music. Plus it can do complex reasoning against the source. 

MOSS is not an acronym: The researchers borrowed the name of a fictional advanced quantum computer from the Chinese film franchise series The Wandering Earth. Also, MOSS is pronounced as “MOSI.”

“In Chinese, ‘MOSI’ can be understood as ‘model thinking.’ This reflects our broader vision of building AI systems that can reason, interact, and assist with complex tasks,” wrote Xipeng Qiu, one of the project’s researchers, in an email to Techstrong.ai

OpenMOSS at work

OpenMOSS offers two pairs of models. One set (“8B”) is for full-tilt precision and the other (“4B”) is geared for running on restricted resources (4B). Each set has two different model optimizations, one for following orders, and the other for stronger reasoning. 

The model recognizes spoken content assigning timestamps at both the word and sentence level. It recognizes the tone and timbre of the speaker, and, with the context, can estimate the emotional state of the speaker. 

The model extracts cues from random background noise, which can be used to infer more context around the speaker.  With music, it can understand musical style, instrumentation and other acoustic features (the lab plans to spin this musical model off in its own release, MOSS-Music).

The LLM can also add metadata to what is being recorded. It can create summaries and answer questions about the recordings.

Audio Enlarges the Context

Having thoroughly mined text-based sources, frontier AI modeling is looking to audio to provide more context. AI transcription services totally miss aspects like sarcasm and subtext (or “shade” in human parlance). The words themselves are only one aspect of human speech.

AudioLLMs “present a striking paradox: models capable of complex reasoning often fail at elementary auditory perception. While they excel on reasoning-heavy benchmarks, they struggle to reliably identify speaker traits, emotion, prosody, or even simple non-linguistic acoustic events,” wrote researchers at Tencent and Peking University in one ArXiv paper

MOSS-Audio is one of a number of similar efforts to expand the audio frontier.

There is also Moshi. Developed by the French Kyutai AI lab, Moshi is a full-duplex audio framework with a 7-billion parameter temporal transformer. A demo on the lab’s website allows the user to vocally interact with the model. Chinese cloud giant Alibaba Cloud’s Qwen3-Omni is a multi-modal model that can interpret and respond equally to audio, text or video. DeSTA2.5-Audio and UALM-Reason are two other efforts in this space. 

All this auditory context will help not only with interpreting the sounds around us but will improve AI communications with the outside world as well, especially with the controversial practice of voice cloning. OpenAI, for instance, recently acquired Weights.gg, a voice cloning start-up. 

A sister project to MOSS-Audio, MOSS-TTS is the Shanghai Innovation Institute’s voice-cloning technology. A promotional video on the GitHub site argues that the voice delivery is inseparable from the story being told, that the speaker’s cadence and tonal quality add to the depth of the material being presented. 

Perhaps audio metadata could also bring more nuance to LLM-generated writing, which now comes with all the monotony of a speech translator.