The Claude Code Postmortem: A Mental Models Reading

I · The Frame

Three cuts that looked like one wound.

Anthropic’s April 23 postmortem is the kind of document that teaches more by its structure than by any individual fact it contains. Three separate engineering changes — a reasoning-effort downgrade on March 4, a caching bug deployed March 26, and a verbosity-reduction prompt on April 16 — each affected different traffic slices on different schedules. The aggregate signal looked like broad, inconsistent degradation. The root causes were independent. The diagnostic challenge was severe.

This is a textbook instance of what safety researchers call “Swiss cheese” failure: multiple independent layers each have a hole, and on a given day, the holes align. The lesson is not that each change was negligent; the lesson is that overlapping independent changes create combinatorial failure spaces that no single review process can reliably catch. The postmortem’s value is in naming this clearly enough to prevent the next instance.

II · The Reinforced

Models the postmortem amplifies.

Reinforced · 01

Systems · Swiss Cheese Model of Failure

Independent holes, aligned by bad luck.

Reason’s Swiss Cheese Model holds that complex system failures require multiple independent defensive layers to simultaneously have a gap. Anthropic’s three bugs are a clean instance: the reasoning downgrade, the caching bug, and the verbosity prompt each had independent safeguards. Each passed individually. Each produced a narrow slice of user impact. Combined, they produced what appeared to be systemic degradation.

The postmortem names the alignment problem directly: “Because each change affected a different slice of traffic on a different schedule, the aggregate effect looked like broad, inconsistent degradation.” The word “inconsistent” is the Swiss Cheese tell: users experienced different failure modes at different times because they were hitting different holes. No single user’s bug report described the system accurately.

Each change affected a different slice of traffic on a different schedule…

Because each change affected a different slice of traffic on a different schedule, the aggregate effect looked like broad, inconsistent degradation. While we began investigating reports in early March, they were challenging to distinguish from normal variation in user feedback at first, and neither our internal usage nor evals initially reproduced the issues identified.

Reinforced · 02

Systems · Second-Order Effects

The caching bug that moved the constraint.

The caching optimization was a reasonable first-order improvement. The second-order effect was catastrophic: a bug caused the thinking-history clearing to happen on every subsequent turn, not once. Each turn stripped the model’s memory of why it had made prior choices. The second-order failure is the kind that passes code review precisely because the first-order intent is obviously correct.

Anthropic describes the compounding: “After a session crossed the idle threshold once, each request for the rest of that process told the API to keep only the most recent block of reasoning and discard everything before it. This compounded: if you sent a follow-up message while Claude was in the middle of a tool use, that started a new turn under the broken flag, so even the reasoning from the current turn was dropped.” Each turn made the next worse.

After a session crossed the idle threshold once, each request cleared more thinking…

After a session crossed the idle threshold once, each request for the rest of that process told the API to keep only the most recent block of reasoning and discard everything before it. This compounded: if you sent a follow-up message while Claude was in the middle of a tool use, that started a new turn under the broken flag, so even the reasoning from the current turn was dropped. Claude would continue executing, but increasingly without memory of why it had chosen to do what it was doing. This surfaced as the forgetfulness, repetition, and odd tool choices people reported.

Reinforced · 03

General Thinking · Inversion

The Opus 4.7 review that caught the bug retroactively.

Inversion asks: what would guarantee failure? Anthropic’s post-investigation move operationalises this. They back-tested Opus 4.7’s Code Review against the offending PRs. Given complete multi-repository context, Opus 4.7 found the caching bug; Opus 4.6 did not. The lesson: multi-repo context is the necessary input for finding cross-system bugs, and reviews that lack cross-system context will miss cross-system failures.

The finding is precise: “When provided the code repositories necessary to gather complete context, Opus 4.7 found the bug, while Opus 4.6 didn’t. To prevent this from happening again, we are now landing support for additional repositories as context for code reviews.” The fix is structural, not model-dependent: more context, not a smarter model.

Opus 4.7 found the bug when given multi-repo context; Opus 4.6 didn't.

As part of the investigation, we back-tested Code Review against the offending pull requests using Opus 4.7. When provided the code repositories necessary to gather complete context, Opus 4.7 found the bug, while Opus 4.6 didn't. To prevent this from happening again, we are now landing support for additional repositories as context for code reviews.

Reinforced · 04

Systems · Margin of Safety

Test coverage that didn’t hold at the intersection.

Margin of safety in software is test coverage. The postmortem lists every defensive layer the caching bug passed: “multiple human and automated code reviews, as well as unit tests, end-to-end tests, automated verification, and dogfooding.” The bug cleared every layer. The implicit finding: corner-case bugs at system intersections require dedicated intersection tests, not just comprehensive unit or end-to-end coverage.

The failure mode is precise: the bug only manifested in stale sessions that had crossed the idle threshold — a corner case that none of the standard tests were scoped to cover. The margin existed; its scope missed the intersection. “It took us over a week to discover and confirm the root cause.”

III · The Contradicted

Models that do not survive intact.

Bent · 01

Engineering Folk Wisdom · Lower Defaults Are Safer Defaults

The medium-effort “safety” move that backfired.

Engineering folklore: default to less capability, reduce blast radius. Anthropic applied this to reasoning effort — high-effort caused occasional UI freezes, switch default to medium. The backfire was immediate: users experienced intelligence degradation as a quality regression, not a latency improvement. The tradeoff that looked conservative was experienced as a failure. The folk wisdom held for latency risk; it failed for intelligence risk.

The gap between “slightly lower intelligence on benchmarks” and “noticeably worse in production” is the asymmetry. Anthropic shipped design iterations to surface the effort setting to users — but most retained the medium default anyway. The reversal to high/xhigh defaults on April 7 was the data-driven encoding of the asymmetry into product policy.

"Slightly lower intelligence" on evals became "feels less intelligent" in production.

Soon after rolling out, users began reporting that Claude Code felt less intelligent. We shipped a number of design iterations to make the current effort setting clearer in order to alert people they could change the default (notices on startup, an inline effort selector, and bringing back ultrathink), but most users retained the medium effort default. After hearing feedback from more customers, we reversed this decision on April 7. All users now default to xhigh effort for Opus 4.7, and high effort for all other models.

Bent · 02

Systems · Observability and Monitoring

When the signal looks like noise.

Standard observability doctrine: instrument everything, alert on anomalies. Anthropic had all of this. The failure was epistemic: three bugs, each producing narrow slices of degradation in different traffic segments, looked like normal variation in aggregate. Observability without segment-aware analysis misses multi-source failures. The monitoring was correct; the inference from monitoring was not.

The postmortem’s timing: degradation reports began in early March; the caching bug wasn’t fixed until April 10; the verbosity prompt wasn’t reverted until April 20. Seven weeks from first reports to full resolution — not from negligence, but from the genuine difficulty of distinguishing three independent signal streams that each looked like noise in isolation.

Bent · 03

Engineering · Ship Small, Learn Fast

Independent small changes can have dependent large effects.

Each of the three changes was small. Each was reasonable. Each shipped on its own review cycle. But they operated on overlapping systems (Claude Code’s context management, the Anthropic API, extended thinking). Their combined effect was a failure space no single reviewer could anticipate. Small-and-independent shipping is safe when changes are genuinely independent. At system intersections, it creates gap risk.

“This bug was at the intersection of Claude Code’s context management, the Anthropic API, and extended thinking.” Three teams, three systems, three shipping cycles. The intersection failure space was nobody’s explicit responsibility. “Ship small” works when small is also isolated.

IV · The New

Models worth adding to the latticework.

New · 01

Coined · Systems & Failure Analysis

The Multi-Slice Phantom.

A failure mode in which multiple independent bugs, each affecting a different user segment, produce an aggregate signal resembling a single system-wide regression — but reproducible by no single user’s scenario. Standard debugging (find one root cause) systematically fails against phantoms. Detection requires segment-aware monitoring plus independent hypothesis testing for each sub-population.

The phantom’s signature is inconsistency: users compare notes, no two have the same failure. Engineers reproduce in some environments but not others. “Because each change affected a different slice of traffic on a different schedule, the aggregate effect looked like broad, inconsistent degradation.” Inconsistency, treated as noise, is the phantom hiding in plain sight.

New · 02

Coined · Engineering & Quality

Intelligence Asymmetry.

Users perceive intelligence decrements more acutely than intelligence increments of equal magnitude. A model that becomes noticeably smarter is “better”; a model that becomes slightly less smart is “broken.” Product decisions about capability trade-offs cannot be evaluated by symmetric loss functions. Implication: default to high capability and offer explicit opt-down. Never silent opt-down.

The reasoning-effort reversal is the cleanest data point: “slightly lower intelligence with significantly less latency” on Anthropic’s own evals. Users experienced it as Claude Code “felt less intelligent” — a qualitative regression. The asymmetry gap between benchmark delta and perceived quality shift is the model’s core claim.

"Slightly lower intelligence" on evals became "feels less intelligent" in production.

New · 03

Coined · Engineering & Testing

Intersection Testing.

A testing discipline targeting the boundaries between systems — not individual system correctness but cross-system behavior when all systems are active simultaneously. The caching bug required intersection testing: Claude Code’s idle-session detection × the Anthropic API’s clear_thinking header × extended thinking state management. Each subsystem tested correctly in isolation; the intersection was untested.

“This bug was at the intersection of Claude Code’s context management, the Anthropic API, and extended thinking. The changes it introduced made it past multiple human and automated code reviews, as well as unit tests, end-to-end tests, automated verification, and dogfooding.” All standard tests. The gap was the intersection — a space none of them were scoped to cover.

New · 04

Coined · Organizational & Process

Camouflage by Experiment.

A failure pattern in which an active A/B test inadvertently masks a bug by suppressing its symptoms in the experimental population — making the bug unreproducible in internal builds even after external user reports. Mitigation: for any bug report that cannot be internally reproduced, cross-reference all active experiments against the affected traffic segment before concluding the report is noise.

“Two unrelated experiments made it challenging for us to reproduce the issue at first: an internal-only server-side experiment related to message queuing; and an orthogonal change in how we display thinking suppressed this bug in most CLI sessions, so we didn’t catch it even when testing external builds.” The bug was real; the experiments hid it from the exact population running internal tests.

Two unrelated experiments masked the caching bug in internal testing environments.

Two unrelated experiments made it challenging for us to reproduce the issue at first: an internal-only server-side experiment related to message queuing; and an orthogonal change in how we display thinking suppressed this bug in most CLI sessions, so we didn't catch it even when testing external builds. Combined with this only happening in a corner case (stale sessions) and the difficulty of reproducing the issue, it took us over a week to discover and confirm the root cause.

V · The Field Card

When to reach for which.

Field Card

Triggers & deployments

Receiving inconsistent user bug reports you can’t reproduce: Reach for The Multi-Slice Phantom. Segment reports by usage pattern, session age, and traffic cohort. Different symptoms per segment = multiple independent bugs, not one.
Deciding whether to ship a capability trade-off: Reach for Intelligence Asymmetry. Apply an asymmetric loss function: a step down costs more than an equal step up gains. Default to high capability; offer explicit opt-down.
Reviewing code that touches two or more systems: Reach for Intersection Testing. Write tests that exercise the cross-system boundary under stateful conditions. Unit tests have no coverage at intersections.
Unable to reproduce an external bug internally: Reach for Camouflage by Experiment. Cross-reference all active A/B tests against the affected traffic segment before concluding the report is noise.
Multiple defensive layers all passed a bad change: Reach for the Swiss Cheese Model. No single layer is the failure point; the question is which combination of gaps aligned. Improve the coverage overlap, not just the individual layers.

VI · Coda

The latticework, after the fix.

Postmortems are one of the highest-density knowledge-transfer documents that engineering organizations produce. This one earns its place in the latticework not because Anthropic made unusual mistakes — they made the ordinary mistakes of a high-velocity organization shipping at system intersections — but because they named those mistakes with unusual precision.

Because each change affected a different slice of traffic on a different schedule, the aggregate effect looked like broad, inconsistent degradation. — Anthropic Engineering · April 2025 Postmortem

Four concepts to carry forward: the Multi-Slice Phantom; Intelligence Asymmetry; Intersection Testing; and Camouflage by Experiment. Add them. You will use them.

★ END ★

A latticework reading of Anthropic Engineering · An Update on Recent Claude Code Quality Reports, against fs.blog/mental-models.

The Three-Bug Problem.