the production readiness review

the production readiness review (prr) is one of those rituals that sounds bureaucratic and turns out to be the cheapest insurance you’ll ever buy. google’s sre book covers a more elaborate version under “early engagement” and “launch coordination engineering” — the checklist below is the one-hour version, sized for teams that don’t have a dedicated sre org.

the format

one hour, one document, one team’s launch on the line. the document is a checklist. the items are specific — not “is the system observable” but “what’s the dashboard url for the new endpoint, and what alerts fire on it.” the team that’s launching presents. the team that’s reviewing asks the questions on the checklist, in order.

the categories that have earned their place

observability. dashboard exists. logs are structured. p99 alert is wired and tested. tracing covers the new code path.
failure modes. what happens if the upstream is down? rate-limited? slow but not failing? the answer can’t be “i don’t know.”
rollback. how do you turn this off in under five minutes? if it’s “we revert and redeploy”, that’s not a rollback. that’s a hope.
data. is there a migration? is it reversible? if not, what’s the path back?
load. what was tested? at what concurrency? what breaks at 10x?
security. new endpoints — authn, authz, rate-limiting, input validation. bypass paths.
on-call. who pages? what runbook? does the runbook reference the right dashboards?

it’s not a gate that stops launches — it’s a forcing function that surfaces the items the team didn’t think about. teams that go through prr a few times start writing the answers into the original design doc. the review gets shorter. that’s the point.

the failure mode of the prr: it becomes a rubber stamp. defenses: rotate the reviewers, allow nobody who built the thing to also review it, keep the document specific to this launch (not a generic template the team filled in).

the cost: an hour. the alternative: an outage at 3am.