From synthetic data to web scraping services, we break down the categories of tech companies helping to create, curate, and manage training data for AI models.
Large language models (LLMs) like OpenAI’s GPT series depend on massive quantities of training data.
For instance, in training GPT-3, OpenAI relied primarily on a Common Crawl dataset containing 45TB of compressed plain text — several times the amount of text found in the Library of Congress. While OpenAI has not disclosed much information around how it trained GPT-4, it’s estimated GPT-4 uses 10x as many parameters as GPT-3.
Even as these models reach sizes of epic proportions, they’re running out of free internet text to train on. As free text becomes less of a competitive differentiator in the coming years, owning proprietary content sources will become all the more valuable.
To continue making advances in LLM performance, model developers are now looking into vendors providing synthetic training data to train models even in the absence of real-life training data, as well as data curation tools to improve the quality and diversity of datasets and reduce redundancy and bias.
In the market map below, we identify 77 AI training data vendors that create, curate, and manage source data for AI models across 7 different categories.
Please click to enlarge.
Want to see more research? Join a demo of the CB Insights platform.
If you’re already a customer, log in here.
