yet-another-applied-llm-benchmark

from blog Simon Willison's Weblog, 6 Nov 2024 | ↗ original

yet-another-applied-llm-benchmark Nicholas Carlini introduced this personal LLM benchmark suite back in February as a collection of over 100 automated tests he runs against new LLM models to evaluate their performance against the kinds of tasks he uses them for. There are two defining features of this benchmark that make it interesting. Most...

This is a short summary. ↗ Open original to view full content

Go talk to the LLM

meain/blog | original ↗

Effort Engine

Tao of Mac | original ↗

LLM-powered Biographies

Eugene Yan | original ↗

Building LLM applications for production

Chip Huyen | original ↗

Two Interesting Use Cases For LLMs

Brain Baking | original ↗

ai, meta, curiousity

Danny O'Brien's Oblomovka | original ↗

Llama 3.2: New Edge AI and Vision Models

Tao of Mac | original ↗

What Building Self-Hosted LLM Systems Taught Me About Software

Blogs on rohan ganapavarapu | original ↗

Bash One-Liners for LLMs

justine.lol | original ↗

Open-LLMs - A list of LLMs for Commercial Use

Eugene Yan | original ↗

More from Simon Willison's Weblog

DeepSeek API Docs: Rate Limit

18 Jan 2025 | original ↗

DeepSeek API Docs: Rate Limit This is surprising: DeepSeek offer the only hosted LLM API I've seen that doesn't implement rate limits: DeepSeek API does NOT constrain user's rate limit. We will try out best to serve every request. However, please note that when our servers are under high traffic pressure, your requests may take some time to...

Lessons From Red Teaming 100 Generative AI Products

18 Jan 2025 | original ↗

Lessons From Red Teaming 100 Generative AI Products New paper from Microsoft describing their top eight lessons learned red teaming (deliberately seeking security vulnerabilities in) 100 different generative AI models and products over the past few years. The Microsoft AI Red Team (AIRT) grew out of pre-existing red teaming initiatives at the...

Quoting Greg Brockman

16 Jan 2025 | original ↗

Manual inspection of data has probably the highest value-to-prestige ratio of any activity in machine learning. — Greg Brockman, OpenAI, Feb 2023 Tags: machine-learning, openai, ai

Quoting gwern

16 Jan 2025 | original ↗

[...] much of the point of a model like o1 is not to deploy it, but to generate training data for the next model. Every problem that an o1 solves is now a training data point for an o3 (eg. any o1 session which finally stumbles into the right answer can be refined to drop the dead ends and produce a clean transcript to train a more refined...

Datasette Public Office Hours Application

16 Jan 2025 | original ↗

Datasette Public Office Hours Application We are running another Datasette Public Office Hours event on Discord tomorrow (Friday 17th January 2025) at 2pm Pacific / 5pm Eastern / 10pm GMT / more timezones here. The theme this time around is lightning talks - we're looking for 5-8 minute long talks from community members about projects they are...

yet-another-applied-llm-benchmark

Related

More from Simon Willison's Weblog