Snapshot Isolation vs Serializability MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; Getting into some fundamentals. In my re:Invent talk on the internals of Aurora DSQL I mentioned that I think snapshot isolation is a sweet spot in the database isolation spectrum for most kinds of applications. Today, I want to dive in a...
DSQL Vignette: Wait! Isn’t That Impossible? MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; Laws of physics are real. In today’s post, I’m going to look at how Aurora DSQL is designed for availability, and how we work within the constraints of the laws of physics. If you’d like to learn more about the product first, check...
DSQL Vignette: Transactions and Durability MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; The hard half of a database system? In today’s post, I’m going to look at the other half of what’s under the covers of Aurora DSQL, our new scalable, active-active, SQL database. If you’d like to learn more about the product first,...
DSQL Vignette: Reads and Compute MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; The easy half of a database system? In today’s post, I’m going to look at half of what’s under the covers of Aurora DSQL, our new scalable, active-active, SQL database. If you’d like to learn more about the product first, check out the official...
DSQL Vignette: Aurora DSQL, and A Personal Story It's happening. In this morning’s re:Invent keynote, Matt Garman announced Aurora DSQL. We’re all excited, and some extremely excited, to have this preview release in customers’ hands. Over the next few days, I’m going to be writing a few posts about what DSQL is, how it works, and how to make the...
Ten Years of AWS Lambda Everything starts somewhere. Today, Werner Vogels shared his annotated version of the original AWS Lambda PRFAQ. This is a great inside look into how product development happens at AWS - the real working backwards process in action. This was, in some ways, the start of serverless computing2. Tim Wagner, Ajay Nair, and...
Garbage Collection and Metastability Cleaning up is hard to do. I’ve written a lot about stability and metastability, but haven’t touched on one other common cause of metastability in large-scale systems: garbage collection. GC is great. Garbage collected languages like Javascript, Java, Python, and Go power a big chunk of the internet’s...
Resource Management in Aurora Serverless Systems, big and small. My favorite thing about distributed systems is how they allow us to solve problems at multiple levels: single process problems, single machine problems, multi-machine problems, and large-scale cluster problems. Our new paper Resource management in Aurora Serverless1 describes what...
Let’s Consign CAP to the Cabinet of Curiosities CAP? Again? Still? Brewer’s CAP theorem, and Gilbert and Lynch’s formalization of it, is the first introduction to hard trade-offs for many distributed systems engineers. Going by the vast amounts of ink and bile spent on the topic, it is not unreasonable for new folks to conclude that it’s an...
Not Just Scale Bookmarking this so I can stop writing it over and over. It seems like everywhere I look on the internet these days, somebody’s making some form of the following argument: You don’t need distributed systems! Computers are so fast these days you can serve all your customers off a single machine! This argument is silly and...
It’s always TCP_NODELAY. Every damn time. It's not the 1980s anymore, thankfully. The first thing I check when debugging latency issues in distributed systems is whether TCP_NODELAY is enabled. And it’s not just me. Every distributed system builder I know has lost hours to latency issues quickly fixed by enabling this simple socket option,...
MemoryDB: Speed, Durability, and Composition. Blocks are fun. Earlier this week, my colleagues Yacine Taleb, Kevin McGehee, Nan Yan, Shawn Wang, Stefan Mueller, and Allen Samuels published Amazon MemoryDB: A fast and durable memory-first cloud database1. I’m excited about this paper, both because its a very cool system, and because it gives us an...
Formal Methods: Just Good Engineering Practice? Yes. The answer is yes. In your face, Betteridge. Earlier this week, I did the keynote at TLA+ conf 2024 (watch the video or check out the slides). My message in the keynote was something I have believed to be true for a long time: formal methods are an important part of good software engineering...
Finding Needles in a Haystack with Best-of-K Keep track of those needles. MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; As I’ve written about before, best of two and best of k are surprisingly powerful tools for load balancing in distributed systems. I have deployed them many times in large-scale production systems, and...
The Builder’s Guide to Better Mousetraps A little rubric for making a tough decision. Some people who ask me for advice at work get very long responses. Sometimes, those responses aren’t specific to my particular workplace, and so I share them here. In the past, I’ve written about writing, writing for an audience, heuristics, getting big things...
Better Benchmarks Through Graphs Isn't the ambiguity in the word *graphs* fun? MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; This is a blog post version of a talk I gave at the Northwest Database Society meeting last week. The slides are here, but I don’t believe the talk was recorded. I believe that one of the things...
How Do You Spend Your Time? Career advice, or something like it. MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; Some people who ask me for advice at work get very long responses. Sometimes, those responses aren’t specific to my particular workplace, and so I share them here. In the past, I’ve written about writing, writing...
Pat’s Big Deal, and Transaction Coordination Working together towards a common goal. MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; I have a lot of opinions about Pat Helland’s CIDR’24 paper Scalable OLTP in the Cloud: What’s the BIG DEAL?1. Most importantly, I like the BIG DEAL that he proposes: Scalable apps don’t...
What is Scalability Anyway? Do words mean things? Why? What does scalable mean? As systems designers, builders, and researchers, we use that word a lot. We kind of all use it to mean that same thing, but not super consistently. Some include scaling both up and down, some just up, and some just down. Some include both scaling on a box (vertical)...
Why Aren’t We SIEVE-ing? Captain, we are being scanned! Long-time readers of this blog will know that I have mixed feelings about caches. One on hand, caching is critical to the performance of systems at every layer, from CPUs to storage to whole distributed architectures. On the other hand, caching being this critical means that designers need...
It’s About Time! What's the time? Time to get a watch. MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; My friend Al Vermeulen used to say time is for the amusement of humans1. Al’s sentiment is still the common one among distributed systems builders: real wall-clock physical time is great for human-consumption (like log...
Optimism vs Pessimism in Distributed Systems What—Me Worry? Avoiding coordination is the one fundamental thing that allows us to build distributed systems that out-scale the performance of a single machine1. When we build systems that avoid coordinating, we end up building components that make assumptions about what other components are doing....
Writing For Somebody Who's there? Sometimes I write long emails to people at work. Sometimes those emails are generally interesting, and not work-specific at all. Sometimes I share those emails here on my blog. This may be one of those times. Always write for somebody. Always have an idea in your head, as you’re writing, who your writing is...
Exponential Value at Linear Cost What a deal! MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; Binary search is kind a of a magical thing. With each additional search step, the size of the haystack we can search doubles. In other words, the value of a search is exponential in the amount of effort. That’s a great deal. There...
On The Acoustics of Cocktail Parties Only parties of well-mannered guests will be considered. MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; If you, like me, tend to practice punctual arrival at parties, you’ve likely noticed that most parties start out quiet. Folks are talking in small groups, using their normal voices,...
Invariants: A Better Debugger? 🎵Some things never change🎵 Like many of my blog posts, this started out as a long email to a colleague. I expanded it here because I thought folks might find it interesting. I don’t tend to use debuggers. I’m not against them. I’ve seen folks do amazing things with gdb, and envy their skills. I just don’t tend to...
My Favorite Bits of OSDI/ATC’23 Talking to 3D people is cool again. This week brought USENIX ATC’23 and OSDI’23 together in Boston. While I’ve followed OSDI and ATC papers for years, it’s the first time I’ve been to either of them (I’ve have been to NSDI a couple times). It was a really good time. In this post I’ll cover a couple of my favorite...
Bélády’s Anomaly Doesn’t Happen Often Anomaly is a really fun word. Try saying it ten times. It was 1969. The Summer of Love wasn’t raging4, Hendrix was playing the anthem, and Forest Gump was running rampant. In New York, IBM researchers Bélády, Nelson, and Schedler were hot on the trail of something strange. They had a paging machine, a...
What is a container? What are words, anyway? A common cause of confusion and miscommunication I see is different people using different definitions of words. Sometimes the definitions are subtly different (as with availability). Sometimes they’re completely different, and we’re just talking about different things entirely. A common example is the...
Container Loading in AWS Lambda Slap shot? MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; Back in 2019, we started thinking about how allow Lambda customers to use container images to deploy their Lambda functions. In theory this is easy enough: a container image is an image of a filesystem, just like the zip files we...
Open and Closed, Omission and Collapse Were you born in a cave? MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; This, from Open Versus Closed: A Cautionary Tale by Schroeder et al1 is one of the most important concepts in systems performance: Workload generators may be classified as based on a closed system model, where...
The Four Hobbies, and Apparent Expertise MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; Around the end of high school, I started to get really into photography. My friend (let’s call him T) was also into it, which should have been great fun. But it wasn’t. Going shooting with him was never great, for a reason I didn’t...
Surprising Scalability of Multitenancy MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; When most folks talk about the economics of cloud systems, their focus is on automatically scaling for long-term seasonality: changes on the order of days (fewer people buy things at night), weeks (fewer people visit the resort on...
False Sharing versus Perfect Placement MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; This is part 3 of an informal series on database scalability. The previous parts were on NoSQL, and Hot Keys. In the last installment, we looked at hot keys and how they affect the theoretical peak scale a database can achieve. Hidden in...
Hot Keys, Scalability, and the Zipf Distribution the: so hot right now. MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; Does your distributed database (or microservices architecture, or queue, or whatever) scale? It’s a good question, and often a relevant one, but almost impossible to answer. To make it a meaningful...
NoSQL: The Baby and the Bathwater Is this a database? This is a bit of an introduction to a long series of posts I’ve been writing about what, fundamentally, it is that makes databases scale. The whole series is going to take me a long time, but hopefully there’s something here folks will enjoy. On March 12 2006, Australia set South Africa the...
Erasure Coding versus Tail Latency There are zero FEC puns in this post, against my better judgement. MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; Jeff Dean and Luiz Barroso’s paper The Tail At Scale popularized an idea they called hedging, simply sending the same request to multiple places and using the first one to...
Under My Thumb: Insight Behind the Rules My left thumb is exactly 25.4mm wide. Starting off in a new field, you hear a lot of rules of thumb. Rules for estimating things, thinking about things, and (ideally) simplifying tough decisions. When I started in Radar, I heard: the transmitter makes up three quarters of the cost of a radar system and...
Lambda Snapstart, and snapshots as a tool for system builders Clones. Yesterday, AWS announced Lambda Snapstart, which uses VM snapshots to reduce cold start times for Lambda functions that need to do a lot of work on start (starting up a language runtime3, loading classes, running static code, initializing caches, etc). Here’s a short 1 minute...
Amazon’s Distributed Computing Manifesto Manifesto made manifest. In the Johannesburg of 1998, I was rocking a middle parting, my friend group was abuzz about the news that there was water (and therefore monsters) on Europa, and all the cool kids were getting satellite TV at home1. Over in Seattle, the folks at Amazon.com had started to notice...
Writing Is Magic Magic can be dangerous. Sometimes when folks ask me for advice at work, I write them very long emails to answer their question. Sometimes, those emails are generally interesting and not work-specific, so I share them here. A couple days ago somebody asked me about how to get better at communicating their ideas and opinions, how...
Give Your Tail a Nudge Tricks are fun. MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; We all care about tail latency (also called high percentile latency, also called those times when your system is weirdly slow). Simple changes that can bring it down are valuable, especially if they don’t come with difficult tradeoffs....
Atomic Commitment: The Unscalability Protocol 2PC is my enemy. MathJax = { tex: {inlineMath: [['$', '$'], ['\\(', '\\)']]} }; Let’s consider a single database system, running on one box, good for 500 requests per second. ┌───────────────────┐ │ Database │ │(good for 500 rps) │ └───────────────────┘ What if we want to access that...
Histogram vs eCDF Accumulation is a fun word. Histograms are a rightfully popular way to present data like latency, throughput, object size, and so on. Histograms avoid some of the difficulties of picking a summary statistic, or group of statistics, which is hard to do right. I think, though, that there’s nearly always a better choice than...
What is Backoff For? Back off man, I'm a scientist. Years ago I wrote a blog post about exponential backoff and jitter, which has turned out to be enduringly popular. I like to believe that it’s influenced at least a couple of systems to add jitter, and become more stable. However, I do feel a little guilty about pushing the popularity of jitter...
Getting into formal specification, and getting my team into it too Getting started is the hard part Sometimes I write long email replies to people at work asking me questions. Sometimes those emails seem like they could be useful to more than just the recipient. This is one of those emails: a reply to a software engineer asking me how they could...
The DynamoDB paper The other database called Dynamo This week at USENIX ATC’22, a group of my colleagues1 from the AWS DynamoDB team are going to be presenting their paper Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service. This paper is a rare look at a real-world distributed system that runs at massive...
Formal Methods Only Solve Half My Problems At most half my problems. I have a lot of problems. The following is a one-page summary I wrote as a submission to HPTS’22. Hopefully it’s of broader interest. Formal methods, like TLA+ and P, have proven to be extremely valuable to the builders of large scale distributed systems1, and to researchers...
What is a simple system? Is this pretentious? Why do I need cryptography when I could simply hide the contents of my communications rotating every letter by 13? Why do I need a distributed storage system when I could simply store my files on this one server? Why do I need a database when I could simply use a flat file? Do any of those things, and...
Simple Simulations for System Builders Even the most basic numerical methods can lead to surprising insights. It’s no secret that I’m a big fan of formal methods. I use P and TLA+ often. I like these tools because they provide clear ways to communicate about even the trickiest protocols, and allow us to use computers to reason about the systems...
Fixing retries with token buckets and circuit breakers Throttle yourself before you DoS yourself. After my last post on circuit breakers, a couple of people reached out to recommend using circuit breakers only to break retries, and still send normal first try traffic no matter the failure rate. That’s a nice approach. It provides possible...
Will circuit breakers solve my problems? Maybe, but you need to know what problem you're trying to solve first. A couple of weeks ago, I started a tiny storm on Twitter by posting this image, and claiming that retries (mostly) make things worse in real-world distributed systems. The bottom line is that retries are often triggered by overload...
Software Deployment, Speed, and Safety There's one right answer that applies in all situations, as always. Disclaimer: Sometime around a 2015, I wrote AWS’s official internal guidance on balancing deployment speed and safety. This blog post is not that. It’s not official guidance from AWS (nothing on this blog is), and certainly not guidance for...
DynamoDB’s Best Feature: Predictability Happy birthday! It’s 10 years since the launch of DynamoDB, Amazon’s fast, scalable, NoSQL database. Back when DynamoDB launched, I was leading the team rethinking the control plane of EBS. At the time, we had a large number of manually-administered MySQL replication trees, which were giving us a lot of...
The Bug in Paxos Made Simple There's not really a bug in Paxos, but clickbait is fun. Over the last few weeks, I’ve been picking up the excellent P programming language, a language for modelling and specifying distributed systems. One of the first things I did in P was implement Paxos - an algorithm I know well, has a lot of subtle failure modes,...
Serial, Parallel, and Quorum Latencies Why are they letting me write Javascript? I’ve written before about the latency effects of series (do X, then Y), parallel (do X and Y, wait for them both), and quorum (do X, Y and Z, return when two of them are done) systems. The effects of these different approaches to doing multiple things are quite...
Caches, Modes, and Unstable Systems Best practices are seldom the best. Is your system having scaling trouble? A bit too slow? Sending too much traffic to the database? Add a caching layer! After all, caches are a best practice and a standard way to build systems. What trouble could following a best practice cause? Lots of trouble, as it turns...
My Proposal for Arecibo: Drones With apologies to real radio astronomers Last night I finally got around to watching Grady Hillhouse’s excellent video on the collapse of the Arecibo Telescope. At the end of Grady’s video he says: I hope eventually that they can replace the telescope with an instrument as futuristic and forward-looking as the...
Latency Sneaks Up On You And is a bad way to measure efficiency. As systems get big, people very reasonably start investing more in increasing efficiency and decreasing costs. That’s a good thing, for the business, for the environment, and often for the customer. Much of the time efficient systems have lower and more predictable latencies, and...
Metastability and Distributed Systems What if computer science had different parents? There’s no more time-honored way to get things working again, from toasters to global-scale distributed systems, than turning them off and on again. The reasons that works so well are varied, but one reason is especially important for the developers and...
Tail Latency Might Matter More Than You Think A frustratingly qualitative approach. Tail latency, also known as high-percentile latency, refers to high latencies that clients see fairly infrequently. Things like: “my service mostly responds in around 10ms, but sometimes takes around 100ms”. There are many causes of tail latency in the world,...
Redundant against what? Threat modeling thinking to distributed systems. There’s basically one fundamental reason that distributed systems can achieve better availability than single-box systems: redundancy. The software, state, and other things needed to run a system are present in multiple places. When one of those places fails, the others can...
What You Can Learn From Old Hard Drive Adverts The single most important trend in systems. Adverts for old computer hardware, especially hard drives, are a fun staple of computer forums and the nerdier side of the internet1. For example, a couple days ago, Glenn Lockwood tweeted out this old ad: At least this isn’t an ad for a HAMR drive. $10k in...
Incident Response Isn’t Enough Single points of failure become invisible. Postmortems, COEs, incident reports. Whatever your organization calls them, when done right they are a popular and effective way of formalizing the process of digging into system failures, and driving change. The success of this approach has lead some to believe that...
The Fundamental Mechanism of Scaling It's not Paxos, unfortunately. A common misconception among people picking up distributed systems is that replication and consensus protocols—Paxos, Raft, and friends—are the tools used to build the largest and most scalable systems. It’s obviously true that these protocols are important building blocks....
Quorum Availability It's counterintuitive, but is it right? In our paper Millions of Tiny Databases, we say this about the availability of quorum systems of various sizes: As illustrated in Figure 4, smaller cells offer lower availability in the face of small numbers of uncorrelated node failures, but better availability when the proportion of...
Getting Big Things Done In one particular context. A while back, a colleague wanted to make a major change in the design of a system, the sort of change that was going to take a year or more, and many tens of person-years of effort. They asked me how to justify the project. This post is part of the email reply I sent. The advice is in context of...
Consensus is Harder Than It Looks And it looks pretty hard. In his classic paper How to Build a Highly Available System Using Consensus Butler Lampson laid out a pattern that’s become very popular in the design of large-scale highly-available systems. Consensus is used to deal with unusual situations like host failures (Lampson says reserved for...
Focus on the Good Parts Skepticism and cynicism can get in your way. Back in May, I wrote Reading Research: A Guide for Software Engineers, answering common questions I get about why and how to read research papers. In that post, I wrote about three modes of reading: solution finding, discovery, and curiosity. In subsequent conversations, I’ve...
Surprising Economics of Load-Balanced Systems The M/M/c model may not behave like you expect. I have a system with c servers, each of which can only handle a single concurrent request, and has no internal queuing. The servers sit behind a load balancer, which contains an infinite queue. An unlimited number of clients offer c * 0.8 requests per...
A Story About a Fish Nothing's more boring than a fishing story. In the 1930s, Marjorie Latimer was working as a museum curator in East London. Not the eastern part of London as one may expect. This East London is a small city on South Africa’s south coast, named so thanks to colonialism’s great tradition of creative and culturally relevant place...
Code Only Says What it Does Only loosely related to what it should do. Code says what it does. That’s important for the computer, because code is the way that we ask the computer to do something. It’s OK for humans, as long as we never have to modify or debug the code. As soon as we do, we have a problem. Fundamentally, debugging is an exercise...
Some Virtualization Papers Worth Reading A short, and incomplete, survey. A while back, Cindy Sridharan asked on Twitter for pointers to papers on the past, present and future of virtualization. A picked a few of my favorites, and given the popularity of that thread I decided to collect some of them here. This isn’t a literature survey by any...
Reading Research: A Guide for Software Engineers Don't be afraid. One thing I’m known for at work is reading research papers, and referring to results in technical conversations. People ask me if, and how, they should read papers themselves. This post is a long-form answer to that question. The intended audience is working software engineers. Why...
Two Years With Rust I like it. I hope it's going to be big. It’s been just over two years since I started learning Rust. Since then, I’ve used it heavily at my day job, including work in the Firecracker code base, and a number of other projects. Rust is a great fit for the systems-level work I’ve been doing over the last few years: often...
Firecracker: Lightweight Virtualization for Serverless Applications Our second paper for NSDI'20. In 2018, we announced Firecracker, an open source VMM optimized for multi-tenant serverless and container workloads. We heard some interest from the research community, and in response wrote up our reasoning behind building Firecracker, and how its...
Physalia: Millions of Tiny Databases Avoiding Hard CAP Tradeoffs A few years ago, when I was still working on EBS, we started building a system called Physalia. Physalia is a custom transactional key-value store, designed to play the role of configuration master in the EBS architecture. Last year, we wrote a paper about Physalia, and were...
Why do we need distributed systems? Building distributed systems is hard. It's expensive. It's complex. But we do it anyway. I grew up reading John Carmack’s .plan file. His stories about the development of Doom, Quake and the rest were a formative experience for me, and a big reason I was interested in computers beyond just gaming1. I was a...
Kindness, Wickedness and Safety We must build kind systems. David Epstein’s book Range: Why Generalists Triumph in a Specialized World turned me on to the idea of Kind and Wicked learning environments, and I’ve found the idea to be very useful in framing all kinds of problems.1 The idea comes from The Two Settings of Kind and Wicked Learning...
When Redundancy Actually Helps Redundancy can harm more than it helps. Just after I joined the EBS team at AWS in 2011, the service suffered a major disruption lasting more than two days to full recovery. Recently, on Twitter, Andrew Certain said: We were super dependent on having a highly available network to make the replication work, so...
Is Anatoly Dyatlov to blame? Without a good safety culture, operators are bound to fail. (Spoiler warning: containers spoilers for the HBO series Chernobyl, and for history). Recently, I enjoyed watching HBO’s new series Chernobyl. Like everybody else on the internet, I have some thoughts about it. I’m not a nuclear physicist or engineer, but I...
Some risks of coordinating only sometimes Sometimes-coordinating systems have dangerous emergent behaviors A classic cloud architecture is built of small clusters of nodes (typically one to nine1), with coordination used inside each cluster to provide availability, durability and integrity in the face of node failures. Coordination between...
Learning to build distributed systems A long email reply A common question I get at work is “how do I learn to build big distributed systems?”. I’ve written replies to that many times. Here’s my latest attempt. Learning how to design and build big distributed systems is hard. I don’t mean that the theory is harder than any other field in computer...
Control Planes vs Data Planes Are there multiple things here? If you want to build a successful distributed system, one of the most important things to get right is the block diagram: what are the components, what does each of them own, and how do they communicate to other components. It’s such a basic design step that many of us don’t think...
Telling Stories About Little’s Law Building Up Intuition with Narrative Little’s Law is widely used as a tool for understanding the behavior of distributed systems. The law says that the mean concurrency in the system (𝐿) is equal to the mean rate at which requests arrive (λ) multiplied by the mean time that each request spends in the system...
Availability and availability Translating math into engineering. It’s well known that the term Availability in the CAP theorem (as formally defined by Gilbert and Lynch) means something different from the term availability that’s commonly used by the designers, builders and operators of distributed systems. Gilbert and Lynch define availability...
Balls Into Bins In Distributed Systems Throwing things can be fun. If you’ve come across the Balls Into Bins problem, you probably heard about in context of hash tables. When you hash things into a hash table (especially with separate chaining) it’s really useful to be able to ask “If I throw 𝑀 balls into 𝑁 bins, what is the distribution of...
Is the Mean Really Useless? Don't be too mean to the mean. “The mean is useless” is a commonly-repeated statement in the systems observation and monitoring world. As people correctly point out, the mean (or average1) tends to hide information about outliers, tends to be optimistic for many metrics, and can even be wildly misleading in presence of...
Why Must Systems Be Operated? Latent Failures and the Safety Margin of Systems Mirrored RAID1 is a classic way of increasing storage durability. It’s also a classic example of a system that’s robust against independent failures, but fragile against dependent failure. Patterson et al’s 1988 paper, which popularized mirroring, even covered the...
Heuristic Traps for Systems Operators What can we learn from avalanche safety? Powder magazine’s new feature The Human Factor 2.0 is a fantastic read. It’s a good disaster story, like the New York Times’ Snow Fall: The Avalanche At Tunnel Creek, but looks deeply at a very interesting topic: the way that we make risk decisions. I think there are...
Is there a CAP theorem for Durability? Expanding the taxonomy of distributed systems. The CAP theorem considers only two of the axes of tradeoffs in distributed systems design. There are many others, including operability, security, latency, integrity, efficiency, and durability. I was recently talking over a beer or two with a colleague about...
CALISDO: Threat Modeling for Distributed Designs Some steps towards a mnemonic threat model for distributed systems. Threat modeling from the security field, and business impact analysis from the continuity management field, are powerful and influential ways of structured thinking about particular kinds of problems. The power of threat modeling...
Sodium Carbonate, and Ramenized Pasta Doing chemistry with baking soda, cabbage and an oven. Ramenizing things is super popular right now. The idea seems to have originated from the Ideas In Food crew, and is obviously adapted from the practice of making Ramen in Japan and similar alkaline noodles in China. In those recipes, the secret is...
The Zero, One, Infinity Disease Numbers are important. “The only reasonable numbers are zero, one and infinity.” (Bruce MacLennan) Rules and heuristics are important. Within our own heads, they are mental shortcuts we use to save ourselves from needing to reason everything out from first principles. Between us, they are devices that we can use...
How Amazon Web Services Uses Formal Methods Now in CACM. How Amazon Web Services Uses Formal Methods is in this month’s Communications of the ACM. This version isn’t changed much from the versions that have been online for a few months, but it’s great to see it get some more attention. In the same issue of CACM is Leslie Lamport’s Who Builds a...
Jitter: Making Things Better With Randomness Jitter is a good thing. Two weeks ago, I wrote an article titled Exponential Backoff and Jitter for the AWS Architecture blog. It looks at OCC in particular, but the lessons are applicable to all distributed systems. The bottom line is that exponential backoff is good, but not sufficient to prevent...
Electoral Trouble in Sybilania An Small Town Struggles To Achieve a Fair Vote. Sybilania1 is a small town on the banks of the Orange river, near where where the river turns North toward the Augrabies falls. The main street runs sleepily from the exclusive retirement communities on the river to the grounds of the Northern Cape Rugby Club,...
Does Bitcoin Solve Byzantine Consensus? An Interesting New Publication on Bitcoin and Consensus. The Bitcoin community is a fascinating mixture of political idealists, technology enthusiasts, entrepreneurs, investors and others. One group that’s increasingly prominent is distributed systems researchers, attracted to some of the interesting...
A Quiet Defense of Patterns Twenty years late to the party. I find myself coming back to Patterns of Software every few years. I think about it often, mostly when I am doing code reviews. One great part is the front matter: a short debate between the author and Christopher Alexander, first author of the much-celebrated A Pattern Language. The...
Make Your Program Slower With Threads How much do context switches matter? Years ago, while taking a numerical methods course, I wrote some code to calculate the expected number of shared birthdays in a group. The code is very simple: each attempt constructs a vector of N birthdays, then counts the duplicates. The outer loop runs millions of...
Two Farmers and Common Knowledge A legislative solution to a technical problem. When their beloved father passed away, Jan and Marie inherited his famous wine estate. Jan and his family were given the historic homestead and half the grape harvesting equipment. Marie’s family were given the other half, and a graceful home with stunning views down...
Exactly-Once Delivery May Not Be What You Want It's hard to get, but that's OK, because you don't want it. Last week, there was a good discussion on lobste.rs about why exactly-once messaging is not possible. The discussion was kicked off with a link to a paper from Patel et al titled Towards In-Order and Exactly-Once Delivery using Hierarchical...
Ice Cream and Distributed Systems Can we serve a fair amount of ice cream? When I was a child, I really liked to eat ice cream. It’s still pretty great, but back then I was somewhat fanatical about it. My parents knew that the delicious mix of fat and sugar was best served only occasionally, and would carefully limit my intake. I, of course,...
Harvest and Yield: Not A Natural Cure for Tradeoff Confusion Comments on a 15 year old paper. As I wrote about in my post on PACELC, I don’t think the CAP theorem is the right way for teachers to present distributed systems tradeoffs. I also don’t think it’s ideal for working practitioners, despite its wide use. I prefer Abadi’s PACELC, but there...
The Essential Barbara Liskov Some of my favorite Barbara Liskov publications. Barbara Liskov is one of the greats of computer science. Over a research career nearing 45 years, she’s had a resounding impact on multiple different fields, and received an impressive list of honors and awards, including the 2009 Turing Award. In the same spirit as The...
The Space Between Theory and Practice in Distributed Systems How do we learn synthesis? Teaching and learning about distributed systems, like any complex topic, requires real thought about what to teach and what to learn. It would be great to have enough time to teach and learn everything, but there’s just too much material out there. Even if we...
Use of Formal Methods at Amazon Web Services How we're using TLA+ at AWS Late last year, we published Use of Formal Methods at Amazon Web Services about our experiences with using formal methods at Amazon Web Services (AWS). The focus is on TLA+, and why we think it’s a great fit for the kind of work we do. From the paper: In order to find...
CAP and PACELC: Thinking More Clearly About Consistency CAP is confusing. PACELC is better, but still not ideal. In some sense, the CAP theorem has been too successful. With its snappy name and apparently easy-to-understand behavior, CAP has become the go-to way of talking about tradeoffs in distributed systems. Despite its apparent simplicity,...
Two traps in iostat: %util and svctm These commonly-used fields in iostat shouldn't be commonly-used. iostat, from the excellent sysstat suite of utilities, is the go-to tool for evaluating IO performance on Linux. It’s obvious why that’s the case: sysstat is very useful, solid, and widely installed. System administrators can go a lot worse than...
The Operations Gradient: Improving Safety in Complex Systems Can we improve the safety of complex systems by listening to operators more? This week, I watched an excellent lecture by Richard Cook. He goes in some detail about why failures happen, through the lens of Rasmussen’s model of system safety. If you build or maintain any kind of complex...
Viewstamped Replication: The Less-Famous Consensus Protocol The first practical consensus protocol may be the least famous. There’s no doubt that Paxos is the most famous distributed consensus protocol. Distributed systems reading lists (such as this one by Dan Creswell and this one by Christopher Meiklejohn) nearly all include at least one paper...
The Essential Nancy Lynch Some of my favorite Nancy Lynch publications. While reading distributed systems papers, one of the names that comes up most often is Nancy Lynch’s. From a standard textbook for university distributed systems courses (Distributed Algorithms), to some of the earliest successful results on consensus, to the proof of the CAP...
Failure Detectors, and Non-Blocking Atomic Commit Non-blocking atomic commit is harder than uniform consensus. Why would that be? Many of the most interesting results in distributed systems have come from looking at problems known to be impossible under one set of constraints, and finding how little those constraints can be relaxed before the...
The Essential Leslie Lamport Some of my favourite Leslie Lamport publications. After it was announced that Leslie Lamport had won the 2013 A.M. Turing award, the link to his list of publications found popularity on most of the tech-related sites I visit. It’s an excellent page, with a long (and growing) list of Lamport’s publications, and witty...
Snark, Chord, and Trust in Algorithms Good journals, well-known authors and informal proofs are not sufficient. Mars Code, in February’s CACM, is a very interesting look from the outside at some of the software engineering practices that helped make the Mars Science Laboratory mission successful. The authors cover a lot of ground in the article,...
Distributed Consensus: Beating Impossibility with Probability One Distributed systems models are critical to understanding impossibility results Reading Nancy Lynch’s 1989 paper A Hundred Impossibility Proofs for Distributed Computing was the first time I came to a real understanding of the value of impossibility proofs. Before reading it, I was...
Restricted Transactional Memory on Haswell Exploring the performance of Intel's RTM In my last post, I looked at the performance of HLE on Intel’s Haswell processor, and found that while it offered a nice speedup in some cases, it can cost performance in others. Still, Intel’s TSX is an extremely exciting technology. In this post, I look at the...
Hardware Lock Elision on Haswell Exploring the performance of Intel's HLE A couple of months ago, I bought myself a new home PC, upgrading from my old Core2 Q6600 to a shiny new Haswell-based Xeon E3-1240v3. Honestly, I don’t use my home PC that much, so the biggest draw for upgrading was trying out some of the features in Haswell, and getting...
Beyond iostat: Storage performance analysis with blktrace An under appreciated set of IO analysis tools. If you’ve spent much time at all investigating IO performance on Linux, you’re no doubt already familiar with iostat from the venerable sysstat package. iostat is the go-to tool for Linux storage performance monitoring with good reason: it’s...
Some Patterns of Engineering Design Meetings On discussing designs. I spend a lot of time in engineering design discussions and meetings, talking about how we are going to introduce new features, solve problems, and increase capacity. For me, a good design meeting is the highlight of any day - sharing ideas, comparing options and considering...
Exploring TLA+ with two-phase commit Using testable pseudocode to test a distributed algorithm There are very few distributed algorithms more widely known by working programmers than the two-phase commit atomic commit protocol. It’s a great algorithm to use for teaching purposes: two-phase commit is both extremely simple to write down, and has...
C++11’s atomic and volatile, under the hood on x86 How do C++11's atomic and volatile work? In my previous post Java’s Atomic and volatile, under the hood on x86 I look at Atomic and volatile in Java, and how they affect the generated assembly. In this post, I’m looking at std::atomic and volatile in C++. Like in Java, it’s well known that...
Java’s Atomic and volatile, under the hood on x86 How exactly do AtomicInteger and volatile do their magic in Java? It’s well known that volatile in Java doesn’t mean the same thing as atomic. As Jeremy Manson says: If you do an increment of a volatile integer, you are actually performing three separate operations: 1) Read the integer to a...
Highly contended and fair locking in Java How do explicit locks compare to volatile access? In my last post on Java’s volatile, I showed how (in one set of experiments) Java volatile variable reads don’t come for free. The cost of accessing a highly-contended volatile variable in one micro-benchmark came out at about 100x the cost of accessing a...
Are volatile reads really free? Some claim that reads from volatile variables are free in Java on x86. Is that claim true? There is a somewhat common belief among Java programmers that reads from volatile variables are free. Are volatile variable ‘reads’ as fast as normal reads? from StackOverflow is a perfect example, because it’s the top result...
Expect Less, Get More? On what newly hired engineers think they need to do. Hiring good software engineers is really difficult. It’s hard to find good people, hard to filter good hires from bad hires, and possibly even harder to decide what ‘good’ really means. When we find a good hire, we want to make sure they can reach their potential as...
Latency lags bandwidth On the growing gap between capacity, bandwidth and latency. One of the key engineering challenges for most high-performance storage systems is minimizing the number of disk seeks that are required to access stored data. Sometimes of this requires smart techniques like reordering, but the majority of the win comes from...
The properties of crash-only software My thoughts about a classic paper Crash-only software by Candea and Fox is a very interesting paper which is well worth your time if you spend any time designing software or services. Re-reading it today, I noticed how useful the section headers of section 3 Properties of Crash-Only Software appear outside...
The power of two random choices Using less information to make better decisions. In many large-scale web services, multiple layers of stateless and stateful services are seperated by load balancers. Load balancing can be done with dedicated hardware, with dedicated software load balancers, using DNS trickery or through a load-balancing mechanism...
The benefits of having data Two ways to look at drive failures and temperature. Almost all recent articles and papers I have read on hard drive failure rates refer to either Failure Trends in a Large Disk Drive Population from Google, or Estimating Drive Reliability in Desktop Computers and Consumer Electronics Systems from Seagate. Despite both...