Elon Musk asked the question during the debut of generative AIx: What happens when generative AIs run out of human-made material to train on and begin to use AI-created material instead? A slew of recent studies are now offering answers, and they point toward a phenomenon known as “model collapse” that could have dire consequences.
Up until now, most of the material on the internet scrapped for use by AIs was created by humans. But with the sudden proliferation of AIs, some experts think 90 percent of the information on the internet will be created by AI’s by 2030. This “synthetic data” isn’t readily identifiable as AI-content detectors have proven unreliable. Referring to large language models critical to AI training, a team of researchers hailing from a group of British universities ranging from Cambridge to Oxford note that “these models now (arguably) pass a weaker form of the Turing test in the sense that their output cannot be reliably distinguished from text written by humans.” AI will invariably wind up training on material produced by its predecessor AIs.
This trend may be exacerbated by what might be a forthcoming “dark age of public information,” noted by Ray Wang, CEO of Constellation Research in a recent essay. Precipitating this dark age is the fear that private industry and government will limit publicly available data in order to protect their intellectual property. Publicly available information may be limited to ads, promotional marketing material and what individuals choose to reveal. Most information will remain in private networks or closed silos; extensions of the pay walls already in use.
So what happens when AIs train on material generated by other AIs? Researchers call the result “model collapse.” As noted by researchers at Stanford and Rice universities in a paper called “Model Autography Disorder” (MAD), what may happen is that generative AI responses to user questions deteriorate in quality. “A self-consuming AI loop of training itself on content generated by other AI inevitably results in generative AI tools doomed to have their quality and diversity of images and text generated falter.” In one example, data input involving the construction of medieval churches went from accuracy in the first generation of training to a discussion of jackrabbits by generation nine, a sequence that added “catastrophic forgetting” into the mix. In other words, model collapse is akin to inbreeding AIs that generate mutant responses, a phenomenon also dubbed “Hapsburg AI.” The generative becomes degenerative. Right now, model collapse is only detectable in its early stages.
Another concern is that AI’s trained by AI’s will be more prone to “hallucinations,” wherein a plausible sounding result is actually woefully inaccurate, such as when AI-written content on MSN recommended the Ottawa Food Bank as a good destination for hungry tourists. Some in the AI field think the hallucination problem is unsolvable, because the source of the inaccuracy is almost impossible to determine. Large language models are relatively fragile because small input changes can result in dramatic changes in outputs. A related danger is data poisoning wherein tiny bits of malicious misinformation inserted during training yields outsized undesirable behaviors.
One other danger is that an AI may generate such a large volume of material that is practically impossible to check. Compounding the difficulty is that some internet news sites, for example, are simply running human-produced articles through an AI and posting the results. Combine this with an AI’s ability to generate mountains of material and the task becomes daunting.
It’s not all is gloom and doom, however. Training material for AIs may come from private deals like the recent licensing arrangement reached between the Associated Press and OpenAI. Conversely, training material certifiably produced by humans may become pricy “premium” content. And at least one AI insider suggests that companies preserve bulk stores of data created before 2023. History may mark the year as a demarcation line, much the way AD and BC define the current calendar.