Layer-wise inferencing + batching: Small VRAM doesn't limit LLM throughput anymore

from blog Languages and Architecture, 14 May 2024 | ↗ original

Also posted today: Higher RAII, and the Seven Arcane Uses of Linear Types, about how linear types let us control the future! Currently, the general consensus is that you can't really run larger LLMs on ordinary computers. This is a short summary. ↗ Open original to view full content

Effort Engine

Tao of Mac | original ↗

Llama 3.2: New Edge AI and Vision Models

Tao of Mac | original ↗

Circular Buffer Performance Trick

Cybernetist | original ↗

A Gentle Introduction to LLVM IR

mcyoung | original ↗

Notes on LLM GUIs

Tao of Mac | original ↗

How I'm using AI as a technical writer

passo.uno | original ↗

Go talk to the LLM

meain/blog | original ↗

LLMs and Programming in the first days of 2024

antirez | original ↗

Speculative Decoding and Beyond: A Survey of Speculative Decoding Techniques

Confessions of a Code Addict | original ↗

M5Stack LLM Module for Edge AI Applications

Tao of Mac | original ↗

More from Languages and Architecture

Crossing the Impossible FFI Boundary, and My Gradual Descent Into Madness

17 Jun 2024 | original ↗

Committing language interop sins for science June 17, 2024 — Anyone trying to make a new mainstream language is completely insane, unless they're backed by a huge corporation.

Exploring Seamless Rust Interop for Newer Languages, Part 1

24 May 2024 | original ↗

Committing language interop sins for science May 24, 2024 — Languages like C++, Typescript, Kotlin, and Swift had a brilliant approach: they were created to harness an existing ecosystem of libraries from another pre-existing language. But that's easier said than done!...

Higher RAII, and the Seven Arcane Uses of Linear Types

14 May 2024 | original ↗

Linear types + whitelisted destroyers = powers yet unimagined! May 14, 2024 — Also posted today: Layer-wise inferencing + batching: Small VRAM doesn't limit LLM throughput anymore, on how even a normal small computer can now run...

Borrow checking, RC, GC, and the Eleven (!) Other Memory Safety Approaches

24 Apr 2024 | original ↗

The Memory Safety Grimoire, Part 1 April 24, 2024 — A fellow named Zeke came into my server one day. Zeke: "Wait, so with generational references, we now have four ways to do memory safety?" Evan: "In fact, there are fourteen by my count....

Vale's First Prototype for Immutable Region Borrowing

11 Jul 2023 | original ↗

Results and measurements! July 11, 2023 — — Sponsor on GitHub or Patreon! Three...

Layer-wise inferencing + batching: Small VRAM doesn't limit LLM throughput anymore

Related

More from Languages and Architecture