EleutherAI Releases Massive Open-Source AI Training Dataset
EleutherAI, a leading AI research organization, has unveiled the "Common Pile v0.1," an 8-terabyte dataset of licensed and public domain text. This massive dataset aims to provide a transparent and legally sound resource for training AI models.
Developed over two years in collaboration with AI startups like Poolside and Hugging Face, along with academic institutions, the Common Pile v0.1 addresses growing copyright concerns in AI training. EleutherAI argues that existing lawsuits against AI companies using copyrighted data have hindered transparency and research progress.
Copyright lawsuits have not meaningfully changed data sourcing practices, but they have drastically decreased transparency from companies. This makes it harder to understand how models work and identify potential flaws.
This quote comes from Stella Biderman, EleutherAI's executive director, in a blog post. She highlights how these lawsuits have hampered research in data-centric AI areas.
The Common Pile: A Legally Sound Alternative
The Common Pile v0.1 draws from sources like 300,000 public domain books from the Library of Congress and the Internet Archive. It also uses OpenAI's open-source Whisper model for audio transcription. The dataset was created in consultation with legal experts to ensure compliance.
EleutherAI has already used the Common Pile v0.1 to train two new 7-billion parameter AI models: Comma v0.1-1T and Comma v0.1-2T. These models reportedly perform comparably to models trained on unlicensed data, demonstrating the dataset's effectiveness.
Benchmark tests show Comma models rivaling Meta's first Llama AI model in coding, image understanding, and math. This suggests that carefully curated, licensed data can produce competitive AI models.
Promoting Openness and Transparency in AI
EleutherAI believes that relying on unlicensed text for AI training is unjustified. They expect the quality of models trained on open-source data to improve as more licensed data becomes available.
The Common Pile v0.1 is available for download from Hugging Face and GitHub. EleutherAI is committed to releasing open datasets more frequently in the future, fostering greater transparency and collaboration in AI research.
This release is a significant step towards addressing ethical and legal challenges in AI development. It provides researchers with a valuable resource for building robust and responsible AI models.