Pindrop, the tech company that tracked down the sources of such high-profile AI-generated voice-cloning incidents targeting President Biden and Elon Musk, has traced the technology used in a more recent case involving Vice President Kamala Harris.
Using the company’s Pindrop Pulse technology, researchers with the 13-year-old company said the unknown individuals behind the video that made the rounds on social media used TorToise, an open-source text-to-speech (TTS) system that can be found on code and project repositories GitHub and HuggingFace as well as in voice-cloning frameworks.
“It’s possible that a commercial vendor could be reusing TorToise in their system,” Rahul Sood, Pindrop’s chief product officer, wrote in a blog post Friday. “It’s also possible that a user employed the open source version.”
The case illustrates the rising concerns about voice cloning and other AI-generated deepfakes as the U.S. presidential election, which is fewer that 100 days away, draws near, Sood wrote. It also shows how other methods for smoking out deepfakes, including watermarking and consent systems, may not be enough to detect misleading images and voices, or to find their sources.
This is important as Americans continue to get more of their news from social media, which also has become an avenue used by politicians to get their messages out. Sood pointed to a Pew Research Center survey that showed that half of U.S. adults are getting their news at least partially from social media like Facebook, YouTube, X (formerly Twitter), TikTok and Instagram.
“When we consume our news on social media, we may assume that the information we’re seeing is honest and credible,” he wrote. “Yet, as a recent parody that uses AI-generated voice cloning of VP Kamala Harris implies, we can’t always believe what we’re hearing.”
The Harris Deepfake
On July 26, Musk reposted a video on X that appeared to be an ad from Harris’ campaign. The video came from an X account, Mr.ReaganUSA. In a follow-up video, a video on the account showed a man saying he “may have singlehandedly ended Kamala Harris’ presidential campaign, with a little help from Elon Musk.”
He also admitted that “the controversy is partially fueled by my use of AI to generate Kamala’s voice.”
“Our research was able to determine more precisely that the audio is a partial deepfake, with AI-generated speech intended to replicate VP Harris’s vocal likeness alongside audio clips from previous remarks by the VP,” Pindrop’s Sood wrote.
Musk’s post was still live as of Wednesday and had 133 million view and 245,000 reposts. There also were 936,000 likes. The same day, the Mr.ReaganUSA X account posted another parody of Harris.
Pulse uses Pindrop’s liveness detection technology to detect unique patterns – like frequency changes and spectral distortions – that differ from natural speech to detect deepfake audios and then uses AI to analyze the patters, creating a “faceprint,” which saves the artifacts that separate machine-generated and generic human speech. It runs a continuous assessment and produces a segment-by-segment analysis every four seconds looking for synthetic audio.
Pindrop researchers reduced the background noise, like music in the video, and then used Pulse to detect what Sood wrote were 15 four-second segments of synthetic voice and six four-second segments that weren’t synthetic, indicating it was at least a partial deepfake.
The researchers then found that the open source TorToise tool was used to create the video.
Challenges of Watermarking
“This incident demonstrates the challenges with watermarking to identify deepfakes and their sources, an issue Pindrop has raised previously,” he wrote. “While several of the top commercial vendors are adopting watermarking, numerous open-source AI systems have not adopted watermarking. Several of these systems have been developed outside the US, making enforcement difficult.”
Among those commercial vendors are Google’s DeepMind AI unit with its SynthID technology, Meta and OpenAI.
Sood also noted that some commercial vendors are considering other tools like consent systems to address the abuse of voice cloning, but they’re difficult to enforce with open source AI systems. There also are no consistent standards third-party tools to validate them.
Finding the Signature
Pulse detects what Pindrop calls the “signature” of the AI generating system.
“Every voice cloning system leaves a unique trace, including the type of input (‘text’ vs ‘voice’), the ‘acoustic model’ used, and the ‘vocoder’ used,” Sood wrote. “Pulse analyzes and maps these unique traces against 350+ AI systems to determine the provenance of the audio.”
Pindrop uses the same approach with other incidents, including in January when robocalls in New Hampshire leading up to the state’s presidential primary used a deepfake voice of President Biden to encourage Democratic voters not to vote. A TTS system from ElevenLabs was used in that case.
ElevenLabs’ systems also were used to create a deepfake that was put on YouTube and created to look like a live stream of Musk, which turned out to be a six-plus minute AI-generated loop that mimicked his voice discussing U.S. politics and elections and their impact on the future of cryptocurrency.
Pindrop released Pulse in February and two months later said the technology 98.23% accurate in detecting synthetic content created using OpenAI’s Voice Engine. Earlier this month, Pindrop got $100 million in debt financing from Hercules Capital, which executives said would be used to further develop its technology for companies in such industries as finance, contact center, insurance and health care.