The Canva outage: another tale of saturation and resilience

from blog Surfing Complexity, 21 Dec 2024 | ↗ original

Today’s public incident writeup comes courtesy of Brendan Humphries, the CTO of Canva. Like so many other incidents that came before, this is another tale of saturation, where the failure mode involves overload. There’s a lot of great detail in Humpries’s write-up, and I recommend you read it directly in addition to this post. What … Continue...

This is a short summary. ↗ Open original to view full content

Cloudflare Workers Are Kind Of Terrible

Welcome To A DevOps Blog on Valewood DevOps Consulting | original ↗

Let's blame the dev who pressed "Deploy"

yield code(); | original ↗

Uptime, status pages, and transparency calculus

Lawrence Jones | original ↗

Is It Time To Version Observability? (Signs Point To Yes)

charity.wtf | original ↗

Stressing the network when it's already down

benjojo blog | original ↗

The anatomy of a 2AM mental breakdown

Zarar's blog | original ↗

Mother of All Outages

Hazel Weakly | original ↗

A few arguments about Redis Sentinel properties and fail scenarios.

antirez | original ↗

It would be cool for 2024 to just calm down

Cassidy Williams | original ↗

The day of the blue screens of death

yield code(); | original ↗

More from Surfing Complexity

Whither dashboard design?

22 Dec 2024 | original ↗

The sorry state of dashboards It’s true: the dashboards we use today for doing operational diagnostic work are … let’s say suboptimal. Charity Majors is one of the founders of Honeycomb, one of the newer generation of observability tools. I’m not a Honeycomb user myself, so I can’t say much intelligently about the product. But … Continue reading...

Quick takes on the recent OpenAI public incident write-up

15 Dec 2024 | original ↗

OpenAI recently published a public writeup for an incident they had on December 11, and there are lots of good details in here! Here are some of my off-the-cuff observations: Saturation With thousands of nodes performing these operations simultaneously, the Kubernetes API servers became overwhelmed, taking down the Kubernetes control plane in...

Your lying virtual eyes

7 Dec 2024 | original ↗

Well, who you gonna believe, me or your own eyes? – Chico Marx (dressed as Groucho), from Duck Soup: In the ACM Queue article Above the Line, Below the Line, the late safety research Richard Cook (of How Complex Systems Fail fame) notes how that we software operators don’t interact directly with the system. Instead, … Continue reading Your lying...

MTTR: When sample means and power laws combine, trouble follows

2 Dec 2024 | original ↗

Think back on all of the availability-impacting incidents that have occurred in your organization over some decent-sized period, maybe a year or more. Is the majority of the overall availability impact due to: If you answered (2), then this suggests that the time-to-resolve (TTR) incident metric in your organization exhibits a power law...

Quick takes on the latest Cloudflare public incident write-up

28 Nov 2024 | original ↗

Cloudflare consistently generates the highest quality public incident writeups of any tech company. Their latest is no exception: Cloudflare incident on November 14, 2024, resulting in lost logs. I wanted to make some quick observations about how we see some common incident patterns here. All of the quotes are from the original Cloudflare post....

The Canva outage: another tale of saturation and resilience

Related

More from Surfing Complexity