GenAI

The Federal Trade Commission (FTC) in November kicked off a project to come up with ways to protect the public from fraud and other scams by bad actors using AI voice-cloning technologies and is now asking for public input.

The agency this week opened a 10-day window for organizations and individuals to submit ideas that address policies, products or procedures that can prevent, monitor and evaluate the malicious use of voice-cloning technologies, from robocalls to fraudsters using such AI tools to impersonate family or friends to lure people into giving them money or information in so-called “grandparent” scams.

Such schemes aren’t new, but the rapid development of text-to-speech (TTS) generative AI tools is making it easier for cybercriminals to launch such scams.

The FTC opening up to submissions to its Voice Cloning Challenge came the same day scientists from MIT, Tsinghua University and AI startup MyShell announced OpenVoice, an open source voice-cloning platform that can that almost instantly clone a voice from a 30-second clip and that gives users greater control over a range of elements, from accents to emotions to intonations, and can do it in multiple languages.

“Today, we proudly open source our OpenVoice algorithm, embracing our core ethos – AI for all,” MyShell announced on X (formerly Twitter).

Development, With Security

The two announcements illustrate the ongoing push and pull between companies big and small that are rushing ahead with innovations in an already accelerating AI market and government agencies and some inside the industry trying to put guardrails in place to reduce the harm that the technology may have on society.

That includes AI voice tools, which can help in such areas as helping those with speech problems to breaching language barriers. As noted, the also can be used in various frauds. The Federal Trade Commission in March 2023 issued a warning about the growing sophistication of grandparent scams thanks to AI.

The challenges will only grow. The global AI voice generator space will reach almost $5 billion by 2032, with the voice cloning market climbing to almost $1.8 billion by 2029.

Protecting the Public

Through its Voice Cloning Challenge, the FTC is hoping to pull in ideas from organizations and individuals from around the country to keep voice-cloning tools like OpenVoice from being used to target consumers in fraud schemes.

“This effort may help push forward ideas to mitigate risks upstream – shielding consumers, creative professionals, and small businesses against the harms of voice cloning before the harm reaches a consumer,” the FTC wrote in on the project’s website. “It also may help advance ideas to mitigate risks at the consumer level.”

At the same time, if no viable ideas come out of the challenge, “this will send a critical and early warning to policymakers that they should consider stricter limits on the use of this technology, given the challenge in preventing harmful development of applications in the marketplace,” the agency wrote, adding that technology alone can address the issues of AI and that policymakers can rely on the industry policing itself.

Those interested in submitted ideas for the FTC’s Voice Cloning Challenge project have until January 12, with the winners being announced soon after. Details of how to submit ideas can be found here.

Pushing the Envelope

OpenVoice is the latest AI voice-cloning platform to roll out in recent years, with others from such vendors as ElevenLabs, Resemble AI, Speechify, and Synthesys AI Studio. In a seven-page research paper released in conjunction with the OpenVoice announcement, the researchers behind the open source platform argued that many of the offerings on the market can clone the “tone color” – or the emotion behind the spoken words – but “they do not allow users to flexibly manipulate other important style parameters such as emotion, accent, rhythm, pauses and intonation.”

“These features are crucial for generating in-context natural speech and conversations, rather than monotonously narrating the input text,” they wrote.

The researchers pulled together two AI models that they then used together. The first is a “base speaker TTS” model that handles specifications within the speech, including emotions, pauses and articulation. It also manages the languages the voice speaks in. They trained the model on 30,000 sentences that ran an average of 7 seconds, and included audio samples from people speaking in English – with American and British accents – Japanese, and Chinese.

The second model is the “tone color converter,” a convolutional neural network that was trained on 300,000 audio samples from 20,000 people, with about 180,000 of the samples in English and 60,000 each in Chinse and Japanese.

The voice created by the base speaker TTS model is passed to the tone color converter, which not only reproduces the speaker’s voice but also enables the user to manipulate the tone of the voice, they wrote.

“The intuition behind the approach is that it is relatively easy to train a base speaker TTS model to control the voice styles and languages, as long as we do not require the model to have the ability to clone the tone color of the reference speaker,” they wrote. “Therefore, we proposed to decouple the tone color cloning from the remaining voice styles and the language, which we believe is the foundational design principle of OpenVoice.”

They added that one driver for making the source code and model weights of the platform publicly available is to facilitate future research.