Microsoft

Microsoft researchers have introduced a generative AI model that can combine a single photo of a person and an existing audio sample to create a realistic animated talking avatar – a talking head, essentially – creating myriad possibilities for everything from video games to live communications for education and health care, while also raising concerns about even more lifelike deepfakes that can be used for cyberattacks or disinformation.

The tech vendor’s Microsoft Research group introduced the VASA-1 AI model in a research paper this month, with Microsoft also launching a dedicated page to the project that includes a number of demonstrations. Both describe an AI model that not only can create avatars whose mouths are synched to the audio clip, but which also can produce realistic emotions and motions.

“Given a static face image of an arbitrary individual, alongside a speech audio clip from any person, our approach is capable of generating a hyper-realistic talking face video efficiently,” the researchers wrote in their 15-page paper. “This video not only features lip movements that are meticulously synchronized with the audio input but also exhibits a wide range of natural, human-like facial dynamics and head movements.”

Similar Projects in the Works

Microsoft is far from being the only company working to create more realistic visual imagery that combines human or human-like images with audio clips. Google Research last month introduced Vlogger, which synthesizes humans from audio samples, while OpenAI’s Sora can create videos from text prompts. Smaller vendors like Pika Lab and Runway also are working on similar models.

However, the Microsoft researchers argue that existing models don’t reach the level of authenticity that VASA-1 does.

“Current research has predominantly focused on the precision of lip synchronization with promising accuracy obtained,” they wrote. “The creation of expressive facial dynamics and the subtle nuances of lifelike facial behavior remain largely neglected. This results in generated faces that seem rigid and unconvincing.”

They added that “natural head movements also play a vital role in enhancing the perception of realism. Although recent studies have attempted to simulate realistic head motions, there remains a sizable gap between the generated animations and the genuine human movement patterns.”

Disentanglement and Diffusion

According to the researchers, their work takes the static image of a person’s head and the audio clip of someone talking or singing – it doesn’t clone or simulate voices – and applies machine learning techniques. Through a disentanglement process to bring the high level of expressiveness by determining everything from facial expressions and features to the position of the head and lets them move independently of each other.

“The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos,” Microsoft says on its project page.

In addition, the diffusion model also takes in optional signals as condition, from the direction of eye gazes – such as forward, right, left, up, and down – to head distance to emotion offsets, delivering more realism, the company wrote. Its method of “out-of-distribution generation … exhibits the capability to handle photo and audio inputs that are out of the training distribution. For example, it can handle artistic photos, singing audios and non-English speech. These types of data were not present in the training set.”

The model can create 512×512 pixel images at 45 frames-per-second (fps) in a batching mode offline and up to 40fps in an online streaming mode, with latency of 170 milliseconds. This can be done on a desktop PC using a single Nvidia RTX 4090 GPU. The model was trained on the dataset from VoxCeleb2, which contains more than a million utterances by more than 6,112 celebrities that were captured from YouTube.

The Good and the Bad

Microsoft and its researchers stressed the positive impacts that VASA-1 can have, including delivering enhanced educational capabilities to more children to assisting people with communication challenges. Companies can use such avatars to improve communications with customers or to provide therapeutic support.

That said, the company also acknowledges the ways it can be abused by people intent on creating deepfakes that can distribute disinformation – a particular threat at times of high-profile elections, for example – be used in cyberattack campaigns, or to create content harmful to individuals and organizations.

CISA, the FBI, and the National Security Agency last fall issued an advisory about the dangers posed by AI- and machine learning-generated deepfakes, writing that “the most substantial threats from the abuse of synthetic media include techniques that threaten an organization’s brand, impersonate leaders and financial officers and use fraudulent communications to enable access to an organization’s networks, communications and sensitive information.”

The agencies outlined ways to identify and respond to the threats from deepfakes.

“Threats from synthetic media, such as deepfakes, have exponentially increased – presenting a growing challenge for users of modern technology and communications, including the National Security Systems (NSS), the Department of Defense (DoD) the Defense Industrial Base (DIB) and national critical infrastructure owners and operators,” CISA wrote.

Microsoft wrote that it opposed the use of such technologies for nefarious purposes, adding that it is “interested in applying our technique for advancing forgery detection. Currently, the videos generated by this method still contain identifiable artifacts, and the numerical analysis shows that there’s still a gap to achieve the authenticity of real videos.”

In addition, the company said that given the dangers of the model to be abused, it had no plans to release an online demo, API, more implementation details or products or related offerings “until we are certain that the technology will be used responsibly and in accordance with proper regulations.”