Generative AI's growth is threatened by a looming data shortage, as web restrictions limit training data availability for large language models.
Connect with technology leaders today!
The advent of generative AI has supercharged the world’s appetite for data, especially high-quality data of known provenance. However, as large language models (LLMs) get bigger, experts are warning that we may be running out of data to train them.
One of the big shifts that occurred with transformer models, which were invented by Google in 2017, is the use of unsupervised learning. Instead of training an AI model in a supervised fashion atop smaller amounts of higher quality, human-curated data, the use of unsupervised training with transformer models opened AI up to the vast amounts of web data of variable quality on the Web.
As pre-trained LLMs have gotten bigger and more capable over the years, they have required bigger and more elaborate training data sets. For instance, when OpenAI released its original GPT-3 model in 2018, the model had about 115 million parameters and was trained on BookCorpus, which is a collection of about 7,000 unpublished books comprising about 4.5 GB of text.
GPT-4, which OpenAI launched in 2023, represented a direct 10x scale-up of GPT-3. The parameter count expanded to 1.5 billion and the training data expanded to about 40GB via the company’s use of WebText, a novel training set it created based on scraped links from Reddit users. WebText contained about 600 billion words and weighed in around 40GB.
With GPT-3, OpenAI expanded its parameter count to 175 billion. The model, which debuted in 2020, was pre-trained on 570 GB of text culled from open sources, including BookCorpus (Book1 and Book2), Common Crawl, Wikipedia, and WebText2. All told, it amounted to about 499 billion tokens.
While official size and training set details are scant for GPT-4, which OpenAI debuted in 2023, estimates peg the size of the LLM at somewhere between 1 trillion and 1.8 trillion, which would make it five to 10 times bigger than GPT-3. The training set, meanwhile, has been reported to be 13 trillion tokens (roughly 10 trillion words).
As the AI models get bigger, the AI model makers have scoured the Web for new sources of data to train them. However, that is getting harder, as the creators and collectors of web data have increasingly imposed data restrictions on the use of data for training AI.
Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter — the tens of trillions of words people have written and shared online. A new study released by research group Epoch AI projects that tech companies will exhaust the supply of publicly available training data for AI language models by roughly the turn of the decade — sometime between 2026 and 2032.
Comparing it to a “literal gold rush” that depletes finite natural resources, Tamay Besiroglu, an author of the study, said the AI field might face challenges in maintaining its current pace of progress once it drains the reserves of human-generated writing.
In the short term, tech companies like ChatGPT-maker OpenAI and Google are racing to secure and sometimes pay for high-quality data sources to train their AI large language models – for instance, by signing deals to tap into the steady flow of sentences coming out of Reddit forums and news media outlets.
In the longer term, there won’t be enough new blogs, news articles, and social media commentary to sustain the current trajectory of AI development, putting pressure on companies to tap into sensitive data now considered private — such as emails or text messages — or relying on less-reliable synthetic data spit out by the chatbots themselves.
“There is a serious bottleneck here,” Besiroglu said. “If you start hitting those constraints about how much data you have, then you can’t really scale up your models efficiently anymore. And scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output.”
As the AI models developed by tech companies become larger, faster, and more ambitious in their capabilities, they require more and more high-quality data to be trained on. Simultaneously, however, websites are beginning to crack down on the use of their text, images, and videos in training AI—a move that has restricted large swathes of content from datasets in what constitutes an “emerging crisis in data provenance,” according to a recent study published by the Data Provenance Initiative, a group led by researchers at the Massachusetts Institute of Technology (MIT).
The study found that in the past year alone, a “rapid crescendo of data restrictions from web sources,” set off by concerns regarding the ethical and legal challenges of AI’s use of public data, has restricted much of the web to both commercial and academic AI institutions. Between April 2023 and April 2024, 5 percent of all data and 25 percent of data from the highest quality sources has been restricted, the researchers found through looking at some 14,000 web domains used to assemble three major datasets known as C4, RefinedWeb, and Dolma.
Major AI companies typically collect data through automatic bots known as web crawlers, which explore the internet and record content. In the case of the C4 dataset, 45 percent of data has become restricted through website protocols preventing web crawlers from accessing content. These restrictions disproportionately affect crawlers from different tech companies and typically advantage “less widely known AI developers,” according to the study.
OpenAI’s crawlers were restricted for nearly 26 percent of high-quality data sources, for example, while Google’s crawler was disallowed from around 10 percent and Meta from 4 percent.
If such constraints weren’t enough, the supply of public data to train AI models is expected to become exhausted soon. Given the current pace of companies working on improving AI models, developers could run out of data between 2026 to 2032, according to a study released in June by the research group Epoch AI.
As Big Tech scrambles to find enough data to support their aggressive AI goals, some companies are striking deals with content-filled publications to gain access to their archives. OpenAI, for example, has reportedly offered publishers between $1 million to $5 million for such partnerships. The AI giant has already entered into deals with publications like the Atlantic, Vox Media, The Associated Press, the Financial Times, Time, and News Corp to use their archives for AI model training, often offering the use of products like ChatGPT in return.
To unlock new data, OpenAI has even considered using Whisper, its speech-recognition tool, to transcribe video and audio from websites like YouTube—a method that has also been discussed by Google. Other AI developers like Meta, meanwhile, have reportedly looked into acquiring publishing companies like Simon & Schuster to obtain its large cache of books.
Another possible solution to the AI data crisis is synthetic data, a term used to describe data generated by AI models instead of humans. OpenAI’s Sam Altman brought up the method during an interview earlier this year where he noted that data from the Internet “will run out” eventually. “As long as you can get over the synthetic data event horizon, where the model is smart enough to make good synthetic data, I think it should be all right,” he said.
Some prominent AI researchers, however, believe fears over an emerging data crisis are overblown. Fei-Fei Li, a Stanford computer scientist often dubbed the “Godmother of AI,” argued that data limitation concerns are a “very narrow view” while speaking at the Bloomberg Technology Summit in May.
The future of AI development hinges on the availability of high-quality training data. As we approach the limits of publicly available data, the industry must innovate to find new sources or risk stalling progress. Whether through partnerships, synthetic data, or new data collection methods, the race is on to sustain the growth and capabilities of AI models.