The Canva outage: another tale of saturation and resilience
More from Surfing Complexity
The sorry state of dashboards It’s true: the dashboards we use today for doing operational diagnostic work are … let’s say suboptimal. Charity Majors is one of the founders of Honeycomb, one of the newer generation of observability tools. I’m not a Honeycomb user myself, so I can’t say much intelligently about the product. But … Continue reading...
OpenAI recently published a public writeup for an incident they had on December 11, and there are lots of good details in here! Here are some of my off-the-cuff observations: Saturation With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in...
Well, who you gonna believe, me or your own eyes? – Chico Marx (dressed as Groucho), from Duck Soup: In the ACM Queue article Above the Line, Below the Line, the late safety research Richard Cook (of How Complex Systems Fail fame) notes how that we software operators don’t interact directly with the system. Instead, … Continue reading Your lying...
Think back on all of the availability-impacting incidents that have occurred in your organization over some decent-sized period, maybe a year or more. Is the majority of the overall availability impact due to: If you answered (2), then this suggests that the time-to-resolve (TTR) incident metric in your organization exhibits a power law...
Cloudflare consistently generates the highest quality public incident writeups of any tech company. Their latest is no exception: Cloudflare incident on November 14, 2024, resulting in lost logs. I wanted to make some quick observations about how we see some common incident patterns here. All of the quotes are from the original Cloudflare post....