incidents · postmortems · lessons

Real incidents.
Real systems.
Better engineers.

A library of structured engineering case studies, drawn from public postmortems. Read what broke, understand why, and steal the lesson.

weekly newsletter

One incident per week. Free, unsubscribe anytime.

~/failure-modes/latest
SHA 4f2a91c
$fm read FM-004
title BGP withdrawal isolates a backbone
org Facebook
when 2021-10-04 · 6h 00m · SEV-1
tags networking, bgp, dns
A maintenance command on the backbone produced an
unintended result: routes advertising the company's
DNS were withdrawn, removing the org from the global
routing table.
$fm lessons
// library  15 incidents to explore. New ones every week.
id
incident
org
when
severity
tags
FM-001
A WAF rule pegs every Cloudflare CPU at onceA new managed WAF rule contained a regex that backtracked exponentially on live HTTP traffic, spiking CPU to nearly 100% across every edge server worldwide within seconds of deployment.
Cloudflare
2019-07-02
SEV-1
networkingdeploywaf
FM-002
A 43-second partition splits GitHub's database for a dayA 43-second network partition between GitHub's East and West Coast sites tripped automatic failover. By the time the partition healed, both coasts had taken writes and reconciling the split took most of a day.
GitHub
2018-10-21
SEV-1
databasereplicationfailover
FM-003
The four-hour S3 typoA maintenance command with the wrong scope argument removed too much S3 subsystem capacity in us-east-1, forcing the index and placement subsystems through full restarts.
Amazon Web Services
2017-02-28
SEV-1
storagetoolingblast-radius
FM-004
Facebook withdraws its own DNS from the internetA backbone command issued to assess global capacity unintentionally took down all of Facebook's backbone. The audit tool that was supposed to block such a command had a bug, and the DNS that announced Facebook to the world withdrew itself in response.
Facebook
2021-10-04
SEV-1
networkingbgpdns
FM-005
A latent CDN bug, woken by a valid config changeA software release shipped 27 days earlier left a latent bug in Fastly's edge platform. A routine, valid customer configuration change triggered it and 85% of Fastly's network began returning errors within seconds.
Fastly
2021-06-08
SEV-1
cdndeployconfig
FM-006
Accidental rm -rf deletes GitLab's production databaseA sysadmin accidentally deleted GitLab.com's production PostgreSQL database. The normal backups were broken or unsuitable, so GitLab restored from a six-hour-old LVM snapshot.
GitLab
2017-01-31
SEV-1
databasebackupoperator-error
FM-007
A maintenance script deletes 883 customer sitesA maintenance script meant to deactivate a deprecated standalone app instead permanently deleted full customer sites. 775 customers lost access to their Jira and Confluence data, and bringing them back took up to two weeks.
Atlassian
2022-04-05
SEV-1
clouddatabaseoperator-error
FM-008
Cloudflare's control plane loses its primary facilityA cascading power failure took out Cloudflare's primary control plane facility. The high-availability cluster did not survive the loss of one of its three sites, and the dashboard, API, and analytics went down while the data plane kept serving customer traffic.
Cloudflare
2023-11-02
SEV-1
datacenterhacontrol-plane
FM-009
A telemetry rollout takes down ChatGPT for four hoursA new telemetry service deployed across OpenAI's Kubernetes clusters generated API operations whose cost scaled with cluster size. The control plane saturated, DNS-based service discovery broke, and the same overload kept the team from rolling the change back.
OpenAI
2024-12-11
SEV-1
deploykubernetescascade
FM-010
Slack's first day back: a Transit Gateway runs out of roomOn the first Monday after the holiday break, an AWS Transit Gateway saturated under Slack's return-to-work traffic. Packet loss hit the web tier just as autoscaling tried to add 1,200 instances, and the provisioning service collapsed under its own quota and file-descriptor limits.
Slack
2021-01-04
SEV-2
cloudscalingcascade
view all 15 incidents →