Releasing the largest multilingual open pretraining dataset

from blog Simon Willison's Weblog, | ↗ original
Releasing the largest multilingual open pretraining dataset Common Corpus is a new "open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens)" released by French AI Lab PleIAs. This appears to be the largest available corpus of openly licensed training data: 926,541,096,243 tokens of public domain...