Will the predicted generative AI model collapse happen, or can we take proactive steps to avert such … [+] a disaster? Here’s the deal.
getty
There is an ongoing and quite heated debate that generative AI and large language models (LLMs) will end up collapsing.
Let’s talk about it.
In today’s column, I am continuing my coverage of the latest trends and controversies in the field of AI, especially in the realm of generative AI. The focus of this discussion will be on the hair-raising claim that generative AI will suffer catastrophic model collapse, a contention that has garnered keen interest and lots of anxious hand-wringing. The contentious matter ought to be of notable concern since generative AI and LLMs are becoming ubiquitous, and the possibility of a collapse could gravely undermine our modern-day world.
For my close-in look at all kinds of erstwhile qualms about AI such as whether advances will reach Artificial General Intelligence (AGI) and catch humankind off-guard, see the link here. Other of my column postings that you might find of interest include examining what seems to be occurring computationally inside generative AI that makes the AI appear so fluent at the link here (the secret sauce, as it were), and a slew of other wide-ranging timely topics see the link here.
Setting The Stage About What’s Going On
First, let’s talk in general about generative AI and large language models (LLMs), doing so to make sure we are on the same page when it comes to discussing the matter at hand.
I’m sure you’ve heard of generative AI, the darling of the tech field these days.
Perhaps you’ve used a generative AI app, such as the popular ones of ChatGPT, GPT-4o, Gemini, Bard, Claude, etc. The crux is that generative AI can take input from your text-entered prompts and produce or generate a response that seems quite fluent. This is a vast overturning of the old-time natural language processing (NLP) that used to be stilted and awkward to use, which has been shifted into a new version of NLP fluency of an at times startling or amazing caliber.
The customary means of achieving modern generative AI involves using a large language model or LLM as the key underpinning.
In brief, a computer-based model of human language is established that in the large has a large-scale data structure and does massive-scale pattern-matching via a large volume of data used for initial data training. The data is typically found by extensively scanning the Internet for lots and lots of essays, blogs, poems, narratives, and the like. The mathematical and computational pattern-matching homes in on how humans write, and then henceforth generates responses to posed questions by leveraging those identified patterns. It is said to be mimicking the writing of humans.
I think that is sufficient for the moment as a quickie backgrounder. Take a look at my extensive coverage of the technical underpinnings of generative AI and LLMs at the link here and the link here, just to name a few.
Back to the crux of things.
Why would anyone believe or assert that generative AI and LLMs are heading toward a massive catastrophic collapse?
The answer is relatively straightforward, namely, the mainstay underpinning has to do with data.
Data, data, data.
I just mentioned a moment ago that generative AI and LLMs are devised by scanning lots and lots of data. Without data, the spate of modern-day AI that seems so fluent would still be stuck in the backwater dark days of clunky old-fashioned natural language processing. You have to make available a ton of data that consists of essays, narratives, stories, poems, and the like, such that there is sufficient material to do a reasonably good job of mimicking human writing.
Some people liken data to oil. They proclaim that the need for and search for data is analogous to the pursuit of oil. We all realize that modern-day machinery such as cars, planes, and so on are dependent on the availability of oil. Oil is what makes the world turn, so they say. The idea is that data is what makes generative AI work. Oil is a precious commodity. So, it is said, is data.
No data, no generative AI.
In a research paper that sought to identify how long we have until we have exhausted the existing supply of data, “Will We Run Out Of Data? An Analysis Of The Limits Of Scaling Datasets In Machine Learning” by Pablo Villalobos, Jaime Sevillay, Lennart Heimx, Tamay Besirogluz, Marius Hobbhahn, and Anson Ho, arVix, October 26, 2022, made these key postulations (excerpts):
“Training data is one of the three main factors that determine the performance of Machine Learning (ML) models, together with algorithms and compute.”
“We analyze the growth of dataset sizes used in machine learning for natural language processing and computer vision and extrapolate these using two methods; using the historical growth rate and estimating the compute-optimal dataset size for future predicted compute budgets.
“We investigate the growth in data usage by estimating the total stock of unlabeled data available on the internet over the coming decades.”
“Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026.”
“By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images).”
“Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available.”
The present gold rush of trying to find data is spurred by the aim of advancing generative AI.
A craving for data is sensible since the advances and money to be made from generative AI require plentiful data. You probably have read about the ongoing deals made between AI makers and firms that have copious amounts of data, such as major print publishers, online communities that have gobs of written postings, etc. Those who have the data want in on the excitement and bucks to be had from generative AI.
Furthermore, as I covered extensively, those who have data are upset that their data was at times scanned and used without them getting a piece of the monetary action, thus, all sorts of copyright and Intellectual Property rights lawsuits are pressingly underway, see my analyses at the link here and the link here, for example.
Here are some of the weighty questions at play. Is it ethical to scan data that is posted publicly on the Internet and use it to build generative AI? Is it legal to do so? Should those who post their data have a say in how the data is utilized? And what about accessing data that sits behind paywalls to scan that too? How can we distinguish between a mere look at data versus the “taking” of data by scanning or even copying it?
The whole kit and kaboodle is a morass, that’s for sure.
Organic Data Versus Synthetic Data
Okay, I assume that you are with me about data being quite important in this heady matter. If we are running out of data, which it seems perhaps we are (not everyone agrees that this is the case, some contending that this is a sky-is-falling falsehood), the simplest form of logic suggests that we should just make more data.
Problem solved, kind of, maybe.
How could we make more data?
One approach entails hiring people to create stories, essays, narratives, and the like. Just pay people to write, write, and do some more writing. They will be your data-producing engine.
This turns out to be a lot more complicated and unsatisfying than you might imagine.
First, people are expensive. The cost to get humans to do writing is exorbitant. Second, people might write junk. They will aim to write as much as they can for the least number of bucks. The data might not be particularly usable. Third, people could be tempted to cheat and use generative AI to do their writing for them. Seriously, why write from scratch when you can enter a few prompts into generative AI and get the writing done on your behalf? You can be sitting at the beach and occasionally check in via your smartphone to ensure that generative AI is doing your tedious writing chore for you.
Nice.
Aha, if people aren’t the most expedient or effective source, the idea of turning toward generative AI as the savior in this instance seems like a smart idea. AI makers can cut out the middleman, as it were, and instead of relying on people to do the needed writing, they can entirely focus on getting generative AI to do it all. AI makers can stridently use generative AI to produce data and feed that data into generative AI to garner further data training for the generative AI.
Sorry to say that this also has headaches. Allow me to back up and provide some added context on these data-related twists and turns.
The data that people compose by human hand is known as organic data, meaning that it was derived by humankind directly (i.e., we are organic beings). Imagine that someone sets up a website with blog postings that they have written. The text on those blogs is considered organic data because it was composed by a human.
In contrast, there is something known as synthetic data.
Synthetic data is generally considered any output that comes out of generative AI or LLMs. Imagine that you are using a generative AI app and ask about the life of Abraham Lincoln. The generative AI produces a nifty essay for you about Lincoln. We would say that the essay generated by the AI is construed as synthetic data. This then is text that was produced by the AI, rather than text that was produced by human hand per se. It is said to be synthetic data, rather than organic data.
Now that you know about those two major types or sources of data, I’d like to bring your attention to something quite worrisome about organic data that I have earlier herein alluded to. Organic data is potentially going to be scanned to its relative entirety such that there is no additional organic data left to be scanned.
Envision that we keep scanning the Internet for any morsel of human-written text. Not all of the human-written stuff is necessarily usable, so we need to realize that some of it isn’t going to do us any good. Of the human-written stuff that is usable, we will keep scanning and scanning to find every notable iota of text. Every inch of the Internet is to be scoured.
Voila, we eventually reach the endpoint, and all available scannable organic data has been found.
What has happened? Well, we have seemingly exhausted the supply of organic data. I’d like to note that this is somewhat of an extremist viewpoint since people are likely to always be making more organic data. It seems doubtful that humans will stop writing stuff that comes out of their heads.
Some charge that humans might indeed stop writing due to becoming reliant on generative AI. As a society, we will somehow decide that writing by human hand is no longer needed and totally and exclusively rely upon generative AI to do our writing. I am skeptical about that particular proposition.
Let’s anyway go along with the premise that either all the organic data will have been scanned and no more is left to be scanned, or that we have done the scanning but that any new organic data is being added at a snail’s pace. In other words, even if humans are still writing and posting, the volume of growth of organic data is merely a trickle. Drips here or there, but nothing of any enormity.
If we exhaust all reasonably usable organic data, the further advancement of generative AI will seemingly hit a brick wall. Whatever amount of organic data we’ve scanned at that juncture is the furthest that generative AI will advance. The bottleneck now is organic data availability. Might as well accept that however generative AI functions at the time will regrettably be the best we will ever do, other than the trickle of new organic data that provides just the tiniest of added value.
Sad face.
Think of the opportunities lost due to the scarcity or exhaustion of organic data. We would have been able to push generative AI to much greater heights, discover cures for all sorts of killer diseases, solve world hunger, etc. Who knows how far we could go? Instead, the limits have been reached because we ran out of organic data. Darn the luck.
Pundits and researchers have made all kinds of predictions about when this exhaustion of organic data is going to happen, such as the example I noted in the research piece mentioned earlier. The usual guess is maybe in 5 years, or perhaps 10 years. All manner of intricate calculations are made about how much organic data there might exist today, how much we have scanned so far, how fast we will scan for more of it, and so on. Critics say that those are perhaps farfetched calculations or based on sketchy assumptions, and we are instead maybe 50 years away rather than just a handful of years away.
Take a reflective moment or two to consider the predicament facing us all.
I’ll wait.
Did you mull over the throes and woes of running out of organic data, and thus reaching a stopping point or bottleneck for advancing generative AI?
I’m sure you did.
Maybe you had an ingenious brainstorm and came up with a potential solution.
Here it is.
The potential solution is that we resort to using synthetic data. The logic seems impeccable. If the organic data that is human-written is our bottleneck, and there isn’t any left or it isn’t being produced fast enough anew, just switch over to synthetic data. Use synthetic data to further the furtherance of generative AI.
The beauty of this solution is that we can pretty much make as much synthetic data as we want. Consider that we tell a generative AI app to start rattling off everything that can be said about the life of Abraham Lincoln. Tell the story of Honest Abe over and over again, doing so in dozens, hundreds, thousands of variations. The volume of data being produced could be astronomical.
We collect this generated data and start to feed it into generative AI as we seek to improve generative AI apps and LLMs. Need even more data? No problem, crank up generative AI to produce more. The sky is the limit.
Thank goodness, we aren’t stuck with the dilemma that we are going to run out of organic data. It won’t matter if we do. The ability to create nearly limitless volumes of synthetic data is sitting right there in front of our noses, waiting for us to turn on the faucet and get as much water (data) as we would ever wish to have.
Yay, problem solved.
Model Collapse Said To Be In Our Future
Let’s not count our chickens before they are hatched.
Some seriously doubt that synthetic data is going to be the rescue hero that it might seem to be. I’ll share with you the mainstay qualm that is most often mentioned and deliberated.
First, are you perchance familiar with a movie that starred Michael Keaton called Multiplicity?
The movie didn’t do that well at the box office so don’t be dismayed if you’ve not heard of it, let alone not seen it. In any case, I bring up the movie because a central premise (spoiler alert!) of the plot is that the main character in the movie makes a clone of himself. That turns out relatively okay, but he makes a clone of that clone. This turns out to be an issue. The second-generation clone is not as sharp as the first clone. A third clone is made. The result is dismal.
It is the old line about what happens when you make a copy of a copy. The copy that is made loses something in the process and isn’t of the same high quality as the original. If you make a copy of the copy, the result gets worse. Each successive copy is a kind of degrading that worsens.
There is a parlor game known as the telephone game that illustrates this same principle. You tell one person something and ask them to tell it to another person. They try to do so but in the telling of things they inadvertently change aspects or fail to repeat precisely everything that was told to them. The person who now is supposed to further pass along the message does the same. By the time the last person in the sequence gets the story told to them, the resultant story being bandied along is a far cry from what the original story was.
Here’s what I am getting at.
Some ardently believe that using synthetic data is akin to that same issue of degradation of quality. They assert that if you try to use synthetic data to data train generative AI, in lieu of organic data, the results will likely be dismal. You are feeding data that is artificial into something artificial. It is said to be problematic and inevitably will lead to a downward spiral for generative AI.
The use of synthetic data is claimed to undercut generative AI and if we aren’t watching out for this, we will find that generative AI and large language models can no longer operate properly. Generative AI will have been bamboozled by low-quality synthetic data. The fluency that we think of for today’s generative AI will presumably disappear or greatly falter.
It could be a disaster in the making.
Done by us, due to lacking the foresight of what synthetic data might cause.
That is especially the case if we aren’t aware of what we are doing. In other words, we might lack astuteness and decide to feed synthetic data as a means of propping up generative AI because we’ve run out of organic data. Our false belief that synthetic data is the solution will lead us down to a dead-end. Ultimately, we will to our shock and dismay see that generative AI has become useless and fruitless.
A Thoughtful Thought Experiment
Let’s do a bit of a thought experiment to see how this might play out.
Suppose that the widely and wildly popular ChatGPT was to be further data trained using synthetic data. First, we tell ChatGPT to start generating zillions of essays on gazillions of topics. A ton of synthetic data is subsequently generated. Next, we take that data and feed it back into ChatGPT, doing additional data re-training of the AI app.
With me so far?
I trust so.
If one round of this is seemingly good, we ought to repeat our endeavors. We take this updated version of ChatGPT and once again tell it to generate gazillions of essays on zillions of topics. Those are fed again into ChatGPT for additional training. But we aren’t done yet. We repetitively cycle this over and over again. Maybe zillions of times, if that’s what we want to do.
Some would say that this is a form of recursion. We are looping on the same thing repeatedly. Sometimes that’s a great way to accomplish things. Sometimes not.
What do you think the status of ChatGPT would be at this juncture after all those zillions upon zillions of retraining exercises based on synthetic data?
One guess is that ChatGPT would be better than ever. It would have had an opportunity to pattern-match on a volume of data that maybe would never have been attainable via organic data. We have possibly super-sized ChatGPT. We put it on steroids.
Others would argue that you are deluding yourself if that’s what you think would arise. They would contend that you are going to suffer the so-called curse of recursion. The curse is that upon using recursion in this fashion, the generative AI is going to be mashed into pure dribble. Similar to the tale of the game of telephone or the plot of the movie Multiplicity, your generative AI is going to sink to a new low and be unrecognizable as to the once stellar capabilities it once had.
Your model is going to collapse.
A formative research paper on this provocative and contentious proposition hammered away at this dour consideration, doing so in a study entitled “The Curse Of Recursion: Training On Generated Data Makes Models Forget” by Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson, arVix, April 14, 2024, which stated this crucial point (excerpt):
“It is now clear that large language models (LLMs) are here to stay and will bring about drastic change in the whole ecosystem of online text and images. In this paper, we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that the use of model-generated content in training causes irreversible defects in the resulting models, where the tails of the original content distribution disappear. We refer to this effect as model collapse.”
They refer to the idea that we will keep iterating generative AI apps, such as OpenAI’s GPT-1, and GPT-2. GPT-3, GPT-4 (which is the top of the ChatGPT line right now), GPT-5 (which is said to be under development) and keep doing so with each iteration numbered as the next nth in the series.
Per my earlier depiction, they assert that the future might consist of a model collapse upon using synthetic data for the continual training of a generative AI app. The synthetic data is characterized as being model-generated, meaning that the generative AI is the source that is creating the synthetic data.
Here are some additional key points that they proffered on this heated topic (excerpts):
“We discover that learning from data produced by other models causes model collapse – a degenerative process whereby, over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time.” (ibid).
“We show that over time we start losing information about the true distribution, which first starts with tails disappearing, and over the generations learned behaviors start converging to a point estimate with very small variance.” (ibid).
“We separate two special cases: early model collapse and late model collapse. In early model collapse the model begins losing information about the tails of the distribution; in the late model collapse model entangles different modes of the original distributions and converges to a distribution that carries little resemblance to the original one, often with very small variance.” (ibid).
“Furthermore, we show that this process is inevitable, even for cases with almost ideal conditions for long-term learning i.e. no function estimation error.” (ibid).
“In other words, the use of LLMs at scale to publish content on the Internet will pollute the collection of data to train them: data about human interactions with LLMs will be increasingly valuable.” (ibid).
There is a lot in there to unpack.
Let’s take a shot at doing so.
Getting To The Bottom Of The See
The extensive use of synthetic data is said to cause model collapse, doing so in a degenerative process that I earlier sketched.
Each successive retraining is presumably going to worsen things, such as the clone being less proficient than the proficiency of the prior clone. Whatever data was used initially to do the data training is now being corrupted, step by step. The whole concoction starts converging towards a kind of lowest common denominator. You are bringing what was good into a spiraling downward abyss.
The researchers postulate that the model collapse might occur at an early stage, and/or might happen at a late stage. In the use case of the early stage, the generative AI might be getting worse and worse at the edges of what it can do. You might still be able to use the generative AI for a bunch of handy stuff, but when you go outside the core areas, the responses will be feeble. The late stage tends toward disrupting across the board and you are essentially ripping shreds throughout the generative AI.
If that doesn’t get the hair to rise on your head, this other mic-dropping point might. It is said that this process of degeneration or degradation is going to be inevitable. The gist is that if things are allowed to go this route, the end result is nearly assured. Doom and gloom await the generative AI at the end of the rainbow.
The final point in the above-bulleted list surfaces a related tangent that will require me to take a moment to explain. Please tighten your seatbelt and mindfully put on a sturdy helmet.
Here’s how a scary additional thought arises.
Right now, we mainly have organic data that is posted on the Internet. Humankind has posted their hand-devised written content. Some of it is fresh, while some consist of data digitized from older works, such as the works of Shakespeare and others. All in all, we will say that the preponderance of data on the Internet is considered organic data, right now.
Generative AI is now among us. People are oftentimes using generative AI to create content. They take that content and post it on the Internet. You probably know that there is a lot of consternation about this. When you read an article, are you reading the words of the stated human author, or are you reading something they produced via generative AI?
You generally cannot discern which is which.
I know that some of you are scratching your head saying that you thought there were surefire ways to detect text that has been produced by generative AI. This comes up quite a bit when discussing students in school and whether they are cheating by using generative AI to write their essays. Just to let you know, despite all those zany banner headlines, by and large, there is no viable way to determine whether data in the wild is written by a human versus produced by generative AI, see my explanation of why this is the case at the link here and the link here. Lamentedly, some people make use of these detectors, not realizing they are often flawed and sometimes gimmicks, and falsely accuse others of using generative AI when in fact they handwrote the material in question.
Anyway, let’s keep on this train of thought that I’m laying out for you.
If generative AI is increasingly used to create content, and the content is posted to the Internet, we will gradually have more and more synthetic data in comparison to the amount of organic data posted out there in the wild. At some point, and since we can easily produce synthetic data at scale, being much easier to produce than laborious human-devised organic content, the Internet is going to overwhelmingly consist of synthetic data.
The amount of synthetic data will far exceed the amount of organic data.
For humans, this is somewhat dismaying because you will read stuff on the Internet and have no idea whether it was written by a human or written by AI. The odds will gradually shift toward the likelihood that anything you are reading is indeed devised by generative AI. That will always be the base assumption when we reach that somewhat dire juncture.
Yes, I realize that’s not the case right now, but it stands to reason that we are heading in that direction. It isn’t something that will necessarily happen slowly. If we ramp up tons of generative AI apps and get them to punch out gazillions of data that gets posted to the Internet, this switcheroo in the balance of content could happen faster than you might think. It might make your head spin.
This then is the situation we are facing. The Internet will be primarily populated with synthetic data that was created via generative AI. Here or there, there is still human-devised data, thus the organic data is still on the Internet, but it is a tiny portion. And, as stated earlier, there is no ready means to discern which data is which.
Let’s return to the issue of model collapse.
We devise a new generative AI app or LLM and opt to go ahead and scan the Internet to use as our training data. Ho-hum, that’s the way things have been done so far, and seems completely mundane and customary. The problem though is that this future of the Internet being dominated by synthetic data means that our generative AI is principally going to be pattern-matching on the use of synthetic data, not organic data.
Yikes, the generative AI is going to be in a sorry state of woe. If you believe that indeed synthetic data is inferior to organic data, we are basing the generative AI on what is now construed as polluted data. The pristine or preferred organic data (which might be foul-mouthed but is at least human-devised), becomes inseparably polluted with synthetic data.
The researchers address this predicament in their above referenced paper (excerpts):
“Our evaluation suggests a ‘first mover advantage’ when it comes to training models such as LLMs.” (ibid).
“To make sure that learning is sustained over a long time period, one needs to make sure that access to the original data source is preserved and that additional data not generated by LLMs remains available over time.” (ibid).
“The need to distinguish data generated by LLMs from other data raises questions around the provenance of content that is crawled from the Internet: it is unclear how content generated by LLMs can be tracked at scale.” (ibid).
“One option is community-wide coordination to ensure that different parties involved in LLM creation and deployment share the information needed to resolve questions of provenance. Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data that was crawled from the Internet prior to the mass adoption of the technology, or direct access to data generated by humans at scale.” (ibid).
I shall briefly go over those points.
If you know that the world is heading toward an Internet with a preponderance of synthetic data, and you believe strongly in your heart that this will be a nightmare for training new generative AI apps, you would be wise to get going right now and build your generative AI app. You want to scan the Internet while all that organic data is not yet polluted.
Get in, while the getting is good, as they say.
In that sense, any AI maker that immediately proceeds at this time to devise their generative AI would be said to be gaining a first-mover advantage. They will have a version of a non-polluted generative AI in their hip pocket. Anyone else that comes down the pike later is going to discover to their shock and dismay that the Internet has turned into mush and the training of their generative AI at that time is going to be a dismal failure due to the synthetic data pollution at scale.
Those that were the first movers are now ahead of the crowd. The rest are left with the polluted data and will regret they didn’t get into the game sooner on.
But even the first movers could easily undermine their own efforts. They could shoot themselves in the foot. How? If they start retraining their generative AI to further advance it, they are risking bringing in that later polluted data as they proceed over time to do the data retraining. Ouch, that hurts.
Do you see the dilemma?
Opt to train now, using primarily organic data. Be ahead of latecomers. The downside is that you are going to have a generative AI based on data at that moment in time. It will be quickly outdated because it only has this non-polluted data from that moment in time. Okay, you say, just do some retraining later. No big deal. Aha, you are now in the buzzsaw of trying to do additional advancement, but the Internet increasingly has polluted data (due to synthetic data taking the helm). You are between a rock and a hard place.
Even the first movers are jammed up.
Maybe we could wave a magic wand and get everyone who is an AI maker to agree to help with identifying at the get-go what data is organic versus what is synthetic. Do this right now, while we are still in a primarily organic data Internet. In a friendly communal manner, this would allow us to pick and choose over time as to whether we want to use organic data or synthetic data for our data training (well, it would also be a means of avoiding a mutually assured destruction, such that they all will otherwise fall flat once we attain an Internet of overwhelmingly polluted data).
Putting together this kind of kumbaya globally agreed data awareness and data sharing arrangement would seem a nice idea but unlikely in the dog-eat-dog world that we live in. Some suggest that we should pass laws on this, forcing the AI makers to work together. Others worry that if the AI makers do this, it will be a monopolistic practice. For my ongoing coverage of these vexing and intriguing AI-and-the-law considerations, see my column at the link here and the link here.
Return Of The Jedi
Take a deep breath.
Are we really going to witness the collapse of generative AI and LLMs?
The tale so far seems to paint a rather gloomy picture that a collapse is imminent and inevitable. The facts have a sense of being unshakable. Organic data will be used up. Synthetic data as an alternative can be readily generated, but it will cause us to endure the curse of recursion. In that sense, model collapse is presumably a fate we must accept. Might as well go home and call it a day.
Wait for a second, perhaps the world isn’t as sour and depressing as it seems.
You see, this boiling pot is a complex stew of speculation, conjecture, and assumptions that are the core ingredients. Let’s take a moment to pick apart the sad face scenario and judge what else might be in store for the future. Going solely with the disturbing shadow of doom and gloom is myopically one-sided.
One handy place to begin the sleuth-like inspection is the question of whether synthetic data is as bad as seems to be proclaimed. Remember that the assertion appears to be that synthetic data is inferior to organic data, possibly radically inferior. If we can somehow overturn that claim, we have a fighting chance at avoiding the model collapse that has been speculated.
In a research study entitled “Best Practices And Lessons Learned On Synthetic Data For Language Models” by Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai, arXiv, April 11, 2024, the researchers make these notable points (excerpts):
“The success of AI models relies on the availability of large, diverse, and high-quality datasets, which can be challenging to obtain due to data scarcity, privacy concerns, and high costs.”
“Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.”
“One of the many benefits of synthetic data is that it can be generated at scale, providing an abundant supply of training and testing data for AI models.”
“Second, synthetic data can be tailored to specific requirements, such as ensuring a balanced representation of different classes by introducing controlled variations (e.g., up-weighting low-resource languages in multilingual language learning.”
“Third, synthetic data can help mitigate privacy concerns by creating anonymized or de-identified datasets that do not contain sensitive personal information.”
“Despite its promise, synthetic data also presents challenges that need to be addressed. One of them is ensuring the factuality and fidelity of synthetic data, as models trained on false, hallucinated or biased synthetic data may fail to generalize to real-world scenarios.”
The crux is that there are a lot of upsides to synthetic data and that we luckily can be simultaneously aware of and can do something about the downsides of synthetic data. We are not powerless and subject to whatever synthetic data happens to be produced. One means of getting caught off-guard is to be totally blindsided by something that you didn’t know could happen.
Hey, the warning signs are here, alerts are buzzing, and we would have to bury our heads deep in the sand to not be cognizant of the coming potential doomsday.
An astute perspective asks these pressing questions:
Can we tune or shape generative AI to try and angle toward useful and high-quality production of data, avoiding the copy-of-a-copy degradations?
Can we leverage post-generative assessment tools to right away detect if generative AI is producing dribble or garbage, and seek to discard it, improve it, or at least stop the flow before it gets out of hand?
Can we have pre-input analysis tools that will scour data before it is pumped into generative AI for data training, seeking to curtail the entry of bad data or data that doesn’t meet suitable quality preferences?
Can we evaluate generative AI to speedily and soundly determine when any degradation seems to be emerging, and thus not allow a collapse per se to happen because of our otherwise being asleep at the wheel?
And so on.
The answer, fortunately, is that those are all strong possibilities and are being researched and pursued at this very moment.
Some vocal doubters insist that no kind of data that comes out of generative AI will ever be as good as organic data devised by human hands. It is claimed that humans write with creativity, emotion, and human ingenuity, which supposedly generative AI will never match. We are sternly told that the nature of synthetic data will always be inferior.
Is that an ironclad rule of irrefutable strength?
Nope.
As I’ve repeatedly covered in my column, we already have lots of indications that generative AI can provide innovative compositions, along with writings that express emotional tidings, see my discussion at the link here and the link here. I am not suggesting that generative AI is sentient (it isn’t). I am saying that the mimicry produced by generative AI can be so stellar that it is on par with humankind’s quality, perhaps at times exceeding conventional levels of quality. Do not be misled into the classic trope that since AI is a machine, we won’t get anything other than dry machine-like responses. Keep in mind that this is mathematical and computational pattern-matching that in the large can generate writing based on how humans write.
The Intermixing Of Organic Data And Synthetic Data
Another linchpin that often goes along with the contention that model collapse is coming consists of assumptions about the takeover of data.
The logic is this. We take in synthetic data, and it displaces any organic data that we have in generative AI. The organic data gets pushed aside, eventually being completely tossed out. Synthetic data is all that we have left, and any trace of organic data has disappeared.
That’s quite an assumption.
Let’s ponder this.
In a research study entitled “Is Model Collapse Inevitable? Breaking the Curse Of Recursion By Accumulating Real And Synthetic Data” by Matthias Gerstgrasser, Rylan Schaeffer, Apratim Dey, Rafael Rafailov, Dhruv Pai, Henry Sleight, John Hughes, Tomasz Korbak, Rajashree Agrawal, Andrey Gromov, Daniel A. Roberts, Diyi Yang, David Donoho, and Sanmi Koyejo, arVix, April 29, 2024, the researchers make these points (excerpts):
“The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs?”
“Recent investigations into model data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless.”
“However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time.”
“We confirm that replacing the original real data by each generation’s synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters.”
“Together, these results strongly suggest that the ‘curse of recursion’ may not be as dire as had been portrayed – provided we accumulate synthetic data alongside real data, rather than replacing real data by synthetic data only.”
You can hopefully discern that rather than necessarily obliterating organic data, we can have organic data remain and be astride the synthetic data. They work hand in hand.
If you believe that organic data is better, it becomes a type of grounding to keep the synthetic data in check. Organic data can be used to decide which synthetic data is worth keeping versus discarding. Organic data can be used to amplify or enhance synthetic data. Etc.
This brings up the other issue about the voluminous nature of producing synthetic data versus the paltry amount of organic data that is hard to find, hard to have created due to labor intensity, and the like. We don’t necessarily need to have any kind of one-for-one or tit-for-tat on organic data and synthetic data. It could be that a tiny portion of organic data can be used to slice and dice immense amounts of synthetic data or enrich vast fields of synthetic data.
Leverage every morsel of organic data. Put the organic data to its fullest possible use. Treasure organic data. Just because you can’t get it easily or cheaply devised doesn’t mean we have reached the end of the world. We can use the precious stuff to aid in ensuring that the synthetic data will work to our advantage.
Win-win when it comes to the brouhaha over organic data versus synthetic data.
Conclusion
A few final remarks for now on this weighty topic.
Charles Dickens famously wrote this opening line in The Tale Of Two Cities:
“It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Life, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way-in short, the period was so far the like present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.”
Mesmerizing, awe-inspiring, breathtaking. An amazing bit of writing and a humankind accomplishment of outstanding merit, no doubt about it.
When it comes to the future of generative AI and the question of a pending model collapse, you could say that it is the best of times and it is the worst of times, all at the same time. The worst of times of course is the looming specter of a model collapse. It won’t though be overnight. Time is to our benefit here. And it won’t be a shocker since we already are cognizant of the possibility, no matter whether it is a remote chance or a solid chance.
A feared generative AI implosion or disintegration is a practical reality only if we fail to see what might happen and fail to take proactive correction. Will we be that visionless? Will we be that daft? The best of times is attributable because we know that a collapse is a theoretical possibility, and we know that we can take steps that will most likely stave off such a hypothesized collapse, and, as a big if, we proceed to do so. We must take sensible action.
Keep your hopes up, and eyes on the ball, and the dreaded disaster can be averted.
One other quick comment.
To illustrate the potential high-brow quality of synthetic data, I told ChatGPT to go ahead and use Charles Dicken’s famed opening line and compose something similar that could express the dilemma of generative AI model collapse.
Here’s what I got from ChatGPT:
“It was the dawn of potential, it was the dusk of doubt, it was the age of innovation, it was the age of caution, it was the epoch of possibilities, it was the epoch of uncertainties, it was the season of creation, it was the season of concern, it was the spring of breakthroughs, it was the winter of skepticism, we had promising futures ahead, we had fears of collapse looming, we were all envisioning progress, we were all fearing setbacks—in short, the period was so like the present period, that while some of its most vocal critics insisted on it being received with caution, the proactive and optimistic among us saw it as a moment ripe with opportunity, ready to be shaped for the better.”
Not too bad stuff for just everyday synthetic data.
You be the judge.
>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Forbes – https://www.forbes.com/sites/lanceeliot/2024/06/30/rethinking-the-doomsday-clamor-that-generative-ai-will-fall-apart-due-to-catastrophic-model-collapse