incidents · postmortems · lessons

Real incidents.
Real systems.
Actionable lessons.

A library of structured engineering case studies, drawn from public postmortems. Read what broke, understand why, and steal the lesson.

weekly newsletter

One useful case study. Every week.

Google

// production judgment

Study the systems engineers actually depend on.

Whiteboard system design teaches ideal architecture. Real incidents teach what happens when deploys, dependencies, traffic, configuration, and rollback pressure collide.

// read first

Start with failures worth remembering.

Production failures rarely come from nowhere. These incidents trace the decisions that seemed reasonable at the time, the cascades that followed, and what engineers did differently when it was over.

FM-004Facebook

The Day Facebook Deleted Its Own Route to the Internet

A backbone command issued to assess global capacity unintentionally took down all of Facebook's backbone. The audit tool that was supposed to block such a command had a bug, and the DNS that announced Facebook to the world withdrew itself in response.

bgpdns~6h

FM-015Microsoft Azure

The Impossible Date That Broke Azure VM Startup

A leap-day bug stopped new Azure VMs from joining the control plane globally, then a rushed recovery update disconnected VMs in seven clusters.

leap-yearvm-startup34h 15m

FM-006GitLab

The `rm -rf` That Erased GitLab's Production Database

A sysadmin accidentally deleted GitLab.com's production PostgreSQL database. The normal backups were broken or unsuitable, so GitLab restored from a six-hour-old LVM snapshot.

databasepostgresql18h 30m

// recent

Recent case studies.

The latest real-world failures, broken down into readable engineering lessons. Understand the system, the weak point, and the pattern before it shows up in your own stack.

FM-018AWS

The Overheated AWS Zone

A thermal event in one US-EAST-1 data center impaired EC2 instances and EBS volumes in use1-az4, disrupting workloads that depended on resources pinned to the affected Availability Zone.

2 months agous-east-1use1-az4

FM-019Slack

The Encryption Path Under Slack Messages

Slack EKM customers experienced message sending, channel loading, workflow, notification, DM, and file-operation issues after elevated encryption-key request load turned a security dependency into an availability bottleneck.

2 months agoenterprise-key-managementkms

FM-017Cloudflare

The DNSSEC Failure That Made .de Look Fake

Incorrect DNSSEC signatures for Germany's .de top-level domain caused validating resolvers to reject .de answers, leading Cloudflare to temporarily bypass DNSSEC validation for the zone.

2 months agodenicde-dnssec

FM-016GitHub

The Search Layer That Slowed GitHub

A concentrated wave of anonymous scraping traffic saturated the load-balancing tier in front of GitHub Search, causing timeouts across issues, pull requests, repositories, Actions, packages, and Dependabot alerts.

2 months agosearchscraping

view full library →

Real incidents.Real systems.Actionable lessons.