Quick takes on the recent OpenAI public incident write-up

from blog Surfing Complexity, 15 Dec 2024 | ↗ original

OpenAI recently published a public writeup for an incident they had on December 11, and there are lots of good details in here! Here are some of my off-the-cuff observations: Saturation With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in...

This is a short summary. ↗ Open original to view full content

OpenAI's postmortem for API, ChatGPT & Sora Facing Issues

Simon Willison's Weblog | original ↗

Fun with Nginx as an API cache

Brain Dump | original ↗

A skeptic's first contact with Kubernetes

Mumbling about computers | original ↗

What I learned from looking at 900 most popular open source AI tools

Chip Huyen | original ↗

In Support of SB 1047

Shtetl-Optimized | original ↗

Strangling your service with a Kubernetes misconfiguration

./techtipsy | original ↗

Is there room for Docker Compose in a Kubernetes world?

Mac's Tech Blog | original ↗

More Unorganised Thoughts about Bluesky

Robb Knight • Posts • RSS Feed | original ↗

0008: the last internal consistency, geoffrey litt's new newsletter, business structure vs quality, aws throttling, papoc, our machinery, on twitter, injuries

Scattered Thoughts | original ↗

A few arguments about Redis Sentinel properties and fail scenarios.

antirez | original ↗

More from Surfing Complexity

You’re missing your near misses

2 Feb 2025 | original ↗

FAA data shows 30 near-misses at Reagan Airport – NPR, Jan 30, 2025 The amount of attention an incident gets is proportional to the severity of the incident: the greater the impact to the organization, the more attention that post-incident activities will get. It’s a natural response, because the greater the impact, the more unsettling … Continue...

The danger of overreaction

12 Jan 2025 | original ↗

The California-based blogger Kevin Drum has a good post up today with the title Why don’t we do more prescribed burning? An explainer. There’s a lot of great detail in the post, but the bit that really jumped out at me was the history of the enormous forest fires that burned in Yellowstone National Park … Continue reading The danger...

Whither dashboard design?

22 Dec 2024 | original ↗

The sorry state of dashboards It’s true: the dashboards we use today for doing operational diagnostic work are … let’s say suboptimal. Charity Majors is one of the founders of Honeycomb, one of the newer generation of observability tools. I’m not a Honeycomb user myself, so I can’t say much intelligently about the product. But … Continue reading...

The Canva outage: another tale of saturation and resilience

21 Dec 2024 | original ↗

Today’s public incident writeup comes courtesy of Brendan Humphries, the CTO of Canva. Like so many other incidents that came before, this is another tale of saturation, where the failure mode involves overload. There’s a lot of great detail in Humpries’s write-up, and I recommend you read it directly in addition to this post. What … Continue...

Your lying virtual eyes

7 Dec 2024 | original ↗

Well, who you gonna believe, me or your own eyes? – Chico Marx (dressed as Groucho), from Duck Soup: In the ACM Queue article Above the Line, Below the Line, the late safety research Richard Cook (of How Complex Systems Fail fame) notes how that we software operators don’t interact directly with the system. Instead, … Continue reading Your lying...

Quick takes on the recent OpenAI public incident write-up

Related

More from Surfing Complexity