Releasing the largest multilingual open pretraining dataset

from blog Simon Willison's Weblog, 14 Nov 2024 | ↗ original

Releasing the largest multilingual open pretraining dataset Common Corpus is a new "open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens)" released by French AI Lab PleIAs. This appears to be the largest available corpus of openly licensed training data: 926,541,096,243 tokens of public domain...

This is a short summary. ↗ Open original to view full content

Growth of Publicly Available Genetic Sequencing Data

Jeff Kaufman's Writing | original ↗

MCTS and LLMs: what's the big deal?

seangoedecke.com RSS feed | original ↗

Analyzing GPT-4 Tokens

Koen van Gilst | original ↗

The LAVA Synthetic Bug Corpora

Push the Red Button | original ↗

Llama 3.2: New Edge AI and Vision Models

Tao of Mac | original ↗

Federal Register Data Exploration with R

nickb.dev | original ↗

Multimodality and Large Multimodal Models (LMMs)

Chip Huyen | original ↗

What I learned from looking at 900 most popular open source AI tools

Chip Huyen | original ↗

ML in Go with a Python sidecar

Eli Bendersky's website | original ↗

0009: 2021 Q1 roundup, updates to internal consistency, garden of forking paths, push vs pull, beca, cambria

Scattered Thoughts | original ↗

More from Simon Willison's Weblog

Quoting Greg Brockman

16 Jan 2025 | original ↗

Manual inspection of data has probably the highest value-to-prestige ratio of any activity in machine learning. — Greg Brockman, OpenAI, Feb 2023 Tags: machine-learning, openai, ai

Quoting gwern

16 Jan 2025 | original ↗

[...] much of the point of a model like o1 is not to deploy it, but to generate training data for the next model. Every problem that an o1 solves is now a training data point for an o3 (eg. any o1 session which finally stumbles into the right answer can be refined to drop the dead ends and produce a clean transcript to train a more refined...

Datasette Public Office Hours Application

16 Jan 2025 | original ↗

Datasette Public Office Hours Application We are running another Datasette Public Office Hours event on Discord tomorrow (Friday 17th January 2025) at 2pm Pacific / 5pm Eastern / 10pm GMT / more timezones here. The theme this time around is lightning talks - we're looking for 5-8 minute long talks from community members about projects they are...

Evolving GitHub Issues (public preview)

16 Jan 2025 | original ↗

Evolving GitHub Issues (public preview) GitHub just shipped the largest set of changes to GitHub Issues I can remember in a few years. As an Issues power-user this is directly relevant to me. The big new features are sub-issues, issue types and boolean operators in search. Sub-issues look to be a more robust formalization of the existing feature...

100x Defect Tolerance: How Cerebras Solved the Yield Problem

16 Jan 2025 | original ↗

100x Defect Tolerance: How Cerebras Solved the Yield Problem I learned a bunch about how chip manufacture works from this piece where Cerebras reveal some notes about how they manufacture chips that are 56x physically larger than NVIDIA's H100. The key idea here is core redundancy: designing a chip such that if there are defects the end-product...

Releasing the largest multilingual open pretraining dataset

Related

More from Simon Willison's Weblog