Three cuts that looked like one wound.
Anthropic’s April 23 postmortem is the kind of document that teaches more by its structure than by any individual fact it contains. Three separate engineering changes — a reasoning-effort downgrade on March 4, a caching bug deployed March 26, and a verbosity-reduction prompt on April 16 — each affected different traffic slices on different schedules. The aggregate signal looked like broad, inconsistent degradation. The root causes were independent. The diagnostic challenge was severe.
This is a textbook instance of what safety researchers call “Swiss cheese” failure: multiple independent layers each have a hole, and on a given day, the holes align. The lesson is not that each change was negligent; the lesson is that overlapping independent changes create combinatorial failure spaces that no single review process can reliably catch. The postmortem’s value is in naming this clearly enough to prevent the next instance.
Models the postmortem amplifies.
Independent holes, aligned by bad luck.
Reason’s Swiss Cheese Model holds that complex system failures require multiple independent defensive layers to simultaneously have a gap. Anthropic’s three bugs are a clean instance: the reasoning downgrade, the caching bug, and the verbosity prompt each had independent safeguards. Each passed individually. Each produced a narrow slice of user impact. Combined, they produced what appeared to be systemic degradation.
The postmortem names the alignment problem directly: “Because each change affected a different slice of traffic on a different schedule, the aggregate effect looked like broad, inconsistent degradation.” The word “inconsistent” is the Swiss Cheese tell: users experienced different failure modes at different times because they were hitting different holes. No single user’s bug report described the system accurately.
Each change affected a different slice of traffic on a different schedule…
The caching bug that moved the constraint.
The caching optimization was a reasonable first-order improvement. The second-order effect was catastrophic: a bug caused the thinking-history clearing to happen on every subsequent turn, not once. Each turn stripped the model’s memory of why it had made prior choices. The second-order failure is the kind that passes code review precisely because the first-order intent is obviously correct.
Anthropic describes the compounding: “After a session crossed the idle threshold once, each request for the rest of that process told the API to keep only the most recent block of reasoning and discard everything before it. This compounded: if you sent a follow-up message while Claude was in the middle of a tool use, that started a new turn under the broken flag, so even the reasoning from the current turn was dropped.” Each turn made the next worse.
After a session crossed the idle threshold once, each request cleared more thinking…
The Opus 4.7 review that caught the bug retroactively.
Inversion asks: what would guarantee failure? Anthropic’s post-investigation move operationalises this. They back-tested Opus 4.7’s Code Review against the offending PRs. Given complete multi-repository context, Opus 4.7 found the caching bug; Opus 4.6 did not. The lesson: multi-repo context is the necessary input for finding cross-system bugs, and reviews that lack cross-system context will miss cross-system failures.
The finding is precise: “When provided the code repositories necessary to gather complete context, Opus 4.7 found the bug, while Opus 4.6 didn’t. To prevent this from happening again, we are now landing support for additional repositories as context for code reviews.” The fix is structural, not model-dependent: more context, not a smarter model.
Opus 4.7 found the bug when given multi-repo context; Opus 4.6 didn't.
Test coverage that didn’t hold at the intersection.
Margin of safety in software is test coverage. The postmortem lists every defensive layer the caching bug passed: “multiple human and automated code reviews, as well as unit tests, end-to-end tests, automated verification, and dogfooding.” The bug cleared every layer. The implicit finding: corner-case bugs at system intersections require dedicated intersection tests, not just comprehensive unit or end-to-end coverage.
The failure mode is precise: the bug only manifested in stale sessions that had crossed the idle threshold — a corner case that none of the standard tests were scoped to cover. The margin existed; its scope missed the intersection. “It took us over a week to discover and confirm the root cause.”
Models that do not survive intact.
The medium-effort “safety” move that backfired.
Engineering folklore: default to less capability, reduce blast radius. Anthropic applied this to reasoning effort — high-effort caused occasional UI freezes, switch default to medium. The backfire was immediate: users experienced intelligence degradation as a quality regression, not a latency improvement. The tradeoff that looked conservative was experienced as a failure. The folk wisdom held for latency risk; it failed for intelligence risk.
The gap between “slightly lower intelligence on benchmarks” and “noticeably worse in production” is the asymmetry. Anthropic shipped design iterations to surface the effort setting to users — but most retained the medium default anyway. The reversal to high/xhigh defaults on April 7 was the data-driven encoding of the asymmetry into product policy.
"Slightly lower intelligence" on evals became "feels less intelligent" in production.
When the signal looks like noise.
Standard observability doctrine: instrument everything, alert on anomalies. Anthropic had all of this. The failure was epistemic: three bugs, each producing narrow slices of degradation in different traffic segments, looked like normal variation in aggregate. Observability without segment-aware analysis misses multi-source failures. The monitoring was correct; the inference from monitoring was not.
The postmortem’s timing: degradation reports began in early March; the caching bug wasn’t fixed until April 10; the verbosity prompt wasn’t reverted until April 20. Seven weeks from first reports to full resolution — not from negligence, but from the genuine difficulty of distinguishing three independent signal streams that each looked like noise in isolation.
Independent small changes can have dependent large effects.
Each of the three changes was small. Each was reasonable. Each shipped on its own review cycle. But they operated on overlapping systems (Claude Code’s context management, the Anthropic API, extended thinking). Their combined effect was a failure space no single reviewer could anticipate. Small-and-independent shipping is safe when changes are genuinely independent. At system intersections, it creates gap risk.
“This bug was at the intersection of Claude Code’s context management, the Anthropic API, and extended thinking.” Three teams, three systems, three shipping cycles. The intersection failure space was nobody’s explicit responsibility. “Ship small” works when small is also isolated.
Models worth adding to the latticework.
The Multi-Slice Phantom.
A failure mode in which multiple independent bugs, each affecting a different user segment, produce an aggregate signal resembling a single system-wide regression — but reproducible by no single user’s scenario. Standard debugging (find one root cause) systematically fails against phantoms. Detection requires segment-aware monitoring plus independent hypothesis testing for each sub-population.
The phantom’s signature is inconsistency: users compare notes, no two have the same failure. Engineers reproduce in some environments but not others. “Because each change affected a different slice of traffic on a different schedule, the aggregate effect looked like broad, inconsistent degradation.” Inconsistency, treated as noise, is the phantom hiding in plain sight.
Intelligence Asymmetry.
Users perceive intelligence decrements more acutely than intelligence increments of equal magnitude. A model that becomes noticeably smarter is “better”; a model that becomes slightly less smart is “broken.” Product decisions about capability trade-offs cannot be evaluated by symmetric loss functions. Implication: default to high capability and offer explicit opt-down. Never silent opt-down.
The reasoning-effort reversal is the cleanest data point: “slightly lower intelligence with significantly less latency” on Anthropic’s own evals. Users experienced it as Claude Code “felt less intelligent” — a qualitative regression. The asymmetry gap between benchmark delta and perceived quality shift is the model’s core claim.
"Slightly lower intelligence" on evals became "feels less intelligent" in production.
Intersection Testing.
A testing discipline targeting the boundaries between systems — not individual system correctness but cross-system behavior when all systems are active simultaneously. The caching bug required intersection testing: Claude Code’s idle-session detection × the Anthropic API’s clear_thinking header × extended thinking state management. Each subsystem tested correctly in isolation; the intersection was untested.
“This bug was at the intersection of Claude Code’s context management, the Anthropic API, and extended thinking. The changes it introduced made it past multiple human and automated code reviews, as well as unit tests, end-to-end tests, automated verification, and dogfooding.” All standard tests. The gap was the intersection — a space none of them were scoped to cover.
Camouflage by Experiment.
A failure pattern in which an active A/B test inadvertently masks a bug by suppressing its symptoms in the experimental population — making the bug unreproducible in internal builds even after external user reports. Mitigation: for any bug report that cannot be internally reproduced, cross-reference all active experiments against the affected traffic segment before concluding the report is noise.
“Two unrelated experiments made it challenging for us to reproduce the issue at first: an internal-only server-side experiment related to message queuing; and an orthogonal change in how we display thinking suppressed this bug in most CLI sessions, so we didn’t catch it even when testing external builds.” The bug was real; the experiments hid it from the exact population running internal tests.
Two unrelated experiments masked the caching bug in internal testing environments.
When to reach for which.
The latticework, after the fix.
Postmortems are one of the highest-density knowledge-transfer documents that engineering organizations produce. This one earns its place in the latticework not because Anthropic made unusual mistakes — they made the ordinary mistakes of a high-velocity organization shipping at system intersections — but because they named those mistakes with unusual precision.
Because each change affected a different slice of traffic on a different schedule, the aggregate effect looked like broad, inconsistent degradation. — Anthropic Engineering · April 2025 Postmortem
Four concepts to carry forward: the Multi-Slice Phantom; Intelligence Asymmetry; Intersection Testing; and Camouflage by Experiment. Add them. You will use them.