An 8TB Open Dataset Is Now Available for LLM Training – Built Entirely from Public and Openly Licensed Sources
EleutherAI has released The Common Pile v0.1 — an 8TB dataset made entirely from public domain and openly licensed text. It marks a new era of transparent, legal, and ethically grounded LLM training built on open data principles.


