Quick takes on the recent OpenAI public incident write-up
More from Surfing Complexity
The sorry state of dashboards It’s true: the dashboards we use today for doing operational diagnostic work are … let’s say suboptimal. Charity Majors is one of the founders of Honeycomb, one of the newer generation of observability tools. I’m not a Honeycomb user myself, so I can’t say much intelligently about the product. But … Continue reading...
Today’s public incident writeup comes courtesy of Brendan Humphries, the CTO of Canva. Like so many other incidents that came before, this is another tale of saturation, where the failure mode involves overload. There’s a lot of great detail in Humpries’s write-up, and I recommend you read it directly in addition to this post. What … Continue...
Well, who you gonna believe, me or your own eyes? – Chico Marx (dressed as Groucho), from Duck Soup: In the ACM Queue article Above the Line, Below the Line, the late safety research Richard Cook (of How Complex Systems Fail fame) notes how that we software operators don’t interact directly with the system. Instead, … Continue reading Your lying...
Think back on all of the availability-impacting incidents that have occurred in your organization over some decent-sized period, maybe a year or more. Is the majority of the overall availability impact due to: If you answered (2), then this suggests that the time-to-resolve (TTR) incident metric in your organization exhibits a power law...
Cloudflare consistently generates the highest quality public incident writeups of any tech company. Their latest is no exception: Cloudflare incident on November 14, 2024, resulting in lost logs. I wanted to make some quick observations about how we see some common incident patterns here. All of the quotes are from the original Cloudflare post....