AI - #7 (Buzzword of the moment: Model Collapse)

Oct 8

Model Collapse - A brief explanation

Has Generative AI already “collapsed”? Wow! That did not take long. I have noticed a bit of panic recently on the part of pundits and various proponents of Generative AI who started off claiming GAI as a panacea.

AI model collapse is a phenomenon where an AI model trained on its own generated content, or on a dataset that contains a significant amount of AI-generated content, begins to produce repetitive, redundant, or nonsensical outputs (hey, that sounds like most of my writing? Is AI guilty of copyrigth violation? :>). This can happen because the model becomes overly reliant on the patterns in its training data, and fails to learn the actual underlying distribution of the data. A fance way of saying it it’s own dogfood, but the the dog food is made of low great luggage leather (just like in real life). Think of this happening as versions of the model iterate and the model is too lazy to get new, clean and improved data. 1.0, 2.0, 3.0 ... etc.

Model collapse is a particularly concerning problem for generative AI models, such as large language models (LLMs) and generative adversarial networks (GANs). These models are trained to produce new and creative content, but they can also be used to generate synthetic data that is indistinguishable from human-generated data. If a new AI model is trained on this synthetic data, it may collapse (start spitting out lousy answers or content) and begin to produce outputs that are similar to the synthetic data it was trained on, rather than reflecting the true world.

Model collapse can have a number of negative consequences. It can lead to the generation of misleading or harmful content, and it can make it difficult to trust the outputs of AI models. It can also make it more difficult to develop new AI models, as they may not be able to learn from the existing data if it is contaminated with synthetic content.

There are a number of ways to mitigate the risk of model collapse. One is to carefully curate the training data, and to avoid using synthetic data unless it is absolutely necessary. Another is to use techniques such as adversarial training and regularization to prevent the model from becoming overly reliant on any particular pattern in the training data.

Researchers are also working on developing new training methods that are more robust to model collapse. For example, some researchers have proposed training AI models on ensembles of datasets, which can help to reduce the impact of any individual dataset that may be contaminated with synthetic content.

As AI models become more powerful and sophisticated, it is increasingly important to be aware of the potential for model collapse. By understanding the risks and taking steps to mitigate them, we can help to ensure that AI models are used safely and responsibly.

Table 1: Human Generated Data vs Synthetic Data

One simple example of model collapse is in the context of a large language model (LLM) trained on a dataset of text and code. If the LLM is not carefully trained, it may learn to produce repetitive or nonsensical outputs, such as:

“This is a sentence. This is a sentence. This is a sentence.” Ruh Roh.

Maybe it’s a broken form of recursion? My wife says I tell the same stories over and over again, so maybe it’s like that? My stories keep getting better though.

This can happen because the LLM becomes overly reliant on the patterns in its training data and fails to learn the true underlying distribution of the data. A brief summary of the meaning of “underlying distribution”:

The “underlying distribution” of the data is related to model collapse in two ways:

The model will collapse if the training data does not accurately represent the underlying distribution of the real-world data. For example, if the training data is biased towards certain types of data, the model will be biased towards those types of data as well. This can lead to the model failing to generalize to real-world data that is not represented in the training data.
The model will collapse if the training data is generated by another AI model. This is because the generated data is likely to reflect the biases and limitations of the model that generated it. As a result, the model trained on the generated data will also be biased and limited.

In this case, the LLM has learned that the pattern "This is a sentence." is a valid output, and it produces this output repeatedly. Humans do this too, just think of climate change or fair and free elections .. .these two are patterns that are repeated over and over again, yet are obviously invalid. This too is a form of model collapse.

Another example of model collapse is in the context of a generative adversarial network (GAN) trained to generate images of faces. If the GAN is not carefully trained, it may collapse and begin to produce images of the same face over and over again. Depending upon the repeated face, this can be a disturbing output.

This can happen because the discriminator becomes too strong, such that the generator fails to produce diverse samples that can fool the discriminator. In this case, the generator has learned that the only way to fool the discriminator is to produce the same face over and over again.

Model collapse can also happen in other types of AI models, such as classification models and regression models. It is important to be aware of the potential for model collapse, and to take steps to mitigate it.

Here are some tips to mitigate model collapse:

Carefully curate the training data. Avoid using synthetic data unless it is absolutely necessary.
Use techniques such as adversarial training and regularization to prevent the model from becoming overly reliant on any particular pattern in the training data.
Monitor the training process closely, and stop training early if you see signs of model collapse.
Use ensembles of models to reduce the impact of any individual model that may collapse.

Sources

www.geeksforgeeks.org/modal-collapse-in-gans/

Jim Pyers

Multi-varient human. Value for Value.

https://www.linkedin.com/in/jamespyers/

AI - #7 (Buzzword of the moment: Model Collapse)

Sources

AI - #8 (Generative AI Economics - Ouch!)

AI - #6 (Replacing Prompt Engineering with Voice! Example AI Application using NLP)