A Loss For Words Fuels Meta’s Content Scraping For Generative AI

Generative AI may soon be at a loss for words, according to a new study from the research group Epoch AI. That prospect is fueling a “literal gold rush” for human-generated verbiage – one seemingly manifested in Meta’s high profile move to scrap Instagram and public Facebook posts beginning June 26th, despite European opposition.

Epoch says AI tech companies will exhaust the supply of publicly available training data for AI language models as early as next year. Once this AI training data is drained, the AI industry may find it difficult to maintain its current phenomenal growth rate. And while the use of synthetic data presents itself as a possible solution, there is a great deal of concern about “digital inbreeding” and increased hallucinations that would degrade AI performance.

Generative AI is like a beast that needs continuous feeding. Increasing the size of large language models (LLM) is crucial for efficiently improving the performance of the large language models vital to generative AI, notes Epoch. The scale already is massive. The largest data bases of human-generated public text databases like RefinedWeb, C4, and RedPajama contain tens of trillions of words collected from billions of web pages.

But that’s not enough. Epoch argues that human-generated public text data cannot sustain scaling, a peak coming as soon as 2026 and no later than 2032 (but perhaps sooner if “frontier” models are over-trained). The indexed web is vast, but it pales in comparison in size to the “deep web,” the largest components of which are closed content platforms like Facebook and Instagram, both of which are owned by Meta. (Elon Musk’s X platform is another.) Parts of these platforms are indexed, but the vast majority is not.

Beginning June 26, Meta will begin using Instagram and public Facebook posts as training data for its AI tool. Meta emphasizes it will only use public posts for AI training and not private posts or messages. Posts from people under the age of 18 also won’t be used. Meta will use public posts as far back as 2007 for AI training. Only users in the state of Illinois and Europe are able to opt out, due to their AI protection regulations and they will be offered an exemption for the training of LLaMa 3, Meta’s next big LLM. Meta is keen to use European sources, however, and positions itself a quasi public service savior saving Europeans from not having access to an AI the rest of the world has.

“If we don’t train our train our models on the public content that Europeans share on our services and others, such as public posts or comments, then models and the AI features they power won’t accurately understand important regional languages, cultures, or trending topics on social media,” said Stefano Fretta, global engagement director for Meta in a company post. “We believe Europeans will be ill-served by AI models that are not informed by Europe’s rich cultural, social and historical contributions.”

Meta is experiencing some pushback in Europe. A Vienna-based advocacy group called NOYB (none of your business) has filed nearly a dozen complaints, alleging that years’ worth of private photos, posts and online tracking can be shared with unidentified third parties. That data also can’t be easily scrubbed from LLMs. Meta’s Facebook and Instagram scraping plans appear to have been largely shaped by EU regulatory considerations. Meta already is being probed in the EU over misinformation and child safety issues.

Alarm also is spreading among artists, photographers and other creatives who have publicly posted their work over the years on Instagram to gain visibility. Creatives say as their work becomes fodder for AI they are losing their livelihoods to the point where artists that Googled their own work found AI-generated images in their own style. Many of these creatives are now moving to Cara, a portfolio app for artists that bans AI posts and training. Meta said in May it considers public Instagram posts part of its training data and that prompted an exit by online creators. Cara says its app grew from 40,000 to 650,000 in a week. This comes on the back of many existing lawsuits alleging AI companies have scraped intellectual property well beyond the fair use doctrine to develop Generative AI. Meanwhile, DuckDuckGo says it has created a way to use several AI chatbots that won’t expose users to AI training.

For its part, Epoch says that techniques like transfer learning, in which an AI can be retrained for another task, and the use of synthetic data offer the best way around a public human text data bottleneck. In addition to privacy concerns, Epoch says the quality of social media content is probably lower and more fragmented than that of web content, making its use more difficult.

Meta begs to differ, it seems. Meta’s need to feed its AI beast won’t stop with Instagram and Facebook posts. “In the future, we anticipate using other content, such as interactions with AI features or chats with a business using AI at Meta AI,” says Fretta. As Epoch notes, another large reservoir of non-public human text data can be found in instant messaging applications like WhatsApp and Messenger, both owned by Meta.

WhatsApp, which has two billion users worldwide, is adding an AI assistant to connect businesses with customers with a rollout beginning in India and Singapore, followed by Brazil and others. Meta says 200 million businesses already rely on WhatsApp to connect with customers. That’s like words on a plate.

A Loss For Words Fuels Meta’s Content Scraping For Generative AI

SHARE THIS STORY

FOLLOW US

A Loss For Words Fuels Meta’s Content Scraping For Generative AI

TECHSTRONG TV

Tech Field Day Events

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP