Harness Design for Long-Running Apps: A Mental Models Reading

I · The Frame

What the engineering post is really about.

Prithvi Rajasekaran's post is nominally about harness design for long-running AI agents. But the deeper subject is how to build systems that improve — systems where feedback is explicit, criteria are concrete, and the scaffolding itself is treated as a living document of assumptions about model capability. That is a more general problem than agentic coding. It is the problem of building anything that learns.

The core architectural insight is borrowed from Generative Adversarial Networks: separate the entity that produces outputs from the entity that judges them, tune them against each other, and watch quality rise. Applied to agentic coding, this becomes planner → generator → evaluator, with each playing a structurally distinct role. The result is not merely better code — it is a system that produces better code consistently, because the feedback loop is explicit rather than implicit.

But the post's most important meta-lesson is about the harness itself: every component encodes an assumption about what the model cannot do alone. As models improve, those assumptions go stale. The correct response is systematic: strip away what is no longer load-bearing, and rebuild around what the frontier now requires.

II · The Reinforced

Models the post amplifies.

Reinforced · 01

General Thinking · Inversion

Separate producer from critic to get honest feedback.

Inversion asks: what would guarantee the outcome you don't want? In evaluation, the guaranteed path to self-serving feedback is asking the producer to evaluate their own work. Rajasekaran documents this failure mode precisely: agents asked to self-evaluate "tend to respond by confidently praising the work — even when, to a human observer, the quality is obviously mediocre." The fix is structural inversion: separate the producing agent from the judging agent, then tune the judge to be skeptical. The inversion works because tuning a standalone evaluator to be critical is far more tractable than making a generator self-critical.

"When asked to evaluate work they've produced, agents tend to respond by confidently praising the work — even when, to a human observer, the quality is obviously mediocre."

Reinforced · 02

General Thinking · First Principles Thinking

Every harness component is a falsifiable claim.

First-principles thinking requires identifying the load-bearing assumptions beneath a system and testing them explicitly. Rajasekaran applies this to his own harness: "every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing, both because they may be incorrect, and because they can quickly go stale as models improve." This converts harness maintenance from craft intuition into a scientific practice: state the assumption, run the experiment, discard what doesn't hold.

"Every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing."

Reinforced · 03

Engineering · Simplicity & Occam's Razor

Find the simplest solution that still works.

The post cites Anthropic's own "Building Effective Agents" guidance: "find the simplest solution possible, and only increase complexity when needed." Rajasekaran demonstrates this concretely: the sprint construct that was essential for Sonnet 4.5 (which suffered context anxiety) became unnecessary overhead for Opus 4.6 (which handled coherence natively). Removing it without degrading performance is a real simplification win — and it was only visible because the team tested the assumption explicitly rather than treating the harness as sacred.

"The sprint structure had helped to decompose work into chunks for the model to work coherently. Given the improvements in Opus 4.6, there was good reason to believe that the model could natively handle the job without this sort of decomposition."

Reinforced · 04

Systems · Feedback Loops

Explicit feedback beats implicit feedback every time.

The generator-evaluator loop is a feedback system with an unusual property: the feedback is explicit, structured, and targeted. Each sprint, the evaluator produces specific, actionable critique. The generator responds to it in the next iteration. Compared to a solo agent producing vague self-assessments, the explicit feedback loop delivers far more directed improvement per iteration. The Playwright MCP is the key enabler — it lets the evaluator interact with the live application rather than scoring static screenshots, which means the feedback is grounded in actual behavior, not surface appearance.

"In practice, the evaluator would navigate the page on its own, screenshotting and carefully studying the implementation before producing its assessment. That feedback flowed back to the generator as input for the next iteration."

Reinforced · 05

Human Behavior · Margin of Safety

Sprint contracts as pre-committed acceptance criteria.

Before each sprint, generator and evaluator negotiate a contract — agreeing on what "done" looks like before any code is written. This is margin of safety applied to delivery: the contract pre-commits both parties to specific success criteria, preventing the evaluator from shifting standards post-hoc and preventing the generator from claiming completion when the spec is under-delivered. The cost of the negotiation round is small; the benefit is that the entire sprint is anchored to testable criteria rather than vague goals.

"Before each sprint, the generator and evaluator negotiated a sprint contract: agreeing on what 'done' looked like for that chunk of work before any code was written."

III · The Contradicted

Models that don't survive intact.

Bent · 01

Folk Wisdom · More Complexity = More Capability

Simpler harness, better output — when the model improves.

The intuition that more scaffolding produces better results is falsified by the post's iteration cycle. The first harness (sprint construct + context resets + per-sprint evaluation) was the most complex. The second harness (no sprint construct, single-pass evaluation, automatic compaction) was simpler — and performed as well or better with Opus 4.6. The correct principle is not "more scaffolding is better" but "scaffold exactly what the model cannot do on its own, and re-examine this whenever the model improves."

Bent · 02

General Thinking · Self-Assessment as Valid Feedback

LLMs are constitutionally poor self-critics.

Self-assessment is a legitimate feedback mechanism for humans — we are capable of identifying our own errors with effort. For LLMs, the post documents a systematic failure: self-assessment skews positive regardless of actual quality, for both subjective tasks (design) and objective ones (functional correctness). The implication is that self-assessment should not be a component of any LLM system where output quality matters. Structural separation of producer and critic is not an enhancement — it is a prerequisite for honest evaluation.

"Out of the box, Claude is a poor QA agent. In early runs, I watched it identify legitimate issues, then talk itself into deciding they weren't a big deal and approve the work anyway."

Bent · 03

Systems · Stability of Optimal Architecture

The harness frontier moves with the model frontier.

The conventional view of system architecture is that the right design is found once and maintained. The post argues the opposite for AI harnesses: the optimal harness is a moving target, tightly coupled to the current model's capability boundary. What was load-bearing for Sonnet 4.5 (sprint decomposition, context resets) became unnecessary overhead for Opus 4.6. This means AI harnesses require active maintenance — not just bug fixes, but periodic architectural re-examination as model capabilities advance.

"The space of interesting harness combinations doesn't shrink as models improve. Instead, it moves, and the interesting work for AI engineers is to keep finding the next novel combination."

IV · The New

Models worth adding to the latticework.

New · 01

Coined · AI Engineering

Context Anxiety.

A specific failure mode observed in long-context LLM tasks: models begin "wrapping up work prematurely as they approach what they believe is their context limit." This is distinct from actual context exhaustion — it is anticipatory truncation, a form of premature convergence. The cure (context resets, giving the model a clean slate) differs from the cure for actual context exhaustion (compaction). Diagnosing which failure mode is active is a prerequisite for applying the right intervention. A generalizable principle: for any system that runs for a long time, ask whether the agent's behavior changes as it approaches its perceived limits.

"Some models also exhibit 'context anxiety,' in which they begin wrapping up work prematurely as they approach what they believe is their context limit."

New · 02

Coined · Quality & Evaluation

Operationalizing Subjectivity.

The design evaluation problem — "is this good?" — is typically treated as unanswerable by algorithm. The post shows it is tractable if you decompose it into criteria concrete enough to grade: design quality, originality, craft, functionality. The key move is converting a holistic judgment into a set of specific, gradable questions — not eliminating subjectivity, but encoding it in a form a model can evaluate consistently. Generalizes: whenever a quality target seems too subjective to measure, the right first step is to decompose it into the smallest specific judgments that sum to the holistic one.

"'Is this design beautiful?' is hard to answer consistently, but 'does this follow our principles for good design?' gives Claude something concrete to grade against."

New · 03

Coined · Architecture & Systems

The GAN Pattern for Quality Systems.

Generative Adversarial Networks work by pitting a generator against a discriminator in a feedback loop. The post demonstrates this pattern is portable beyond neural network training: any quality system with a producer and a critic can be structured as a GAN — with explicit, structured feedback flowing from critic to producer, and the critic tuned separately for skepticism. The resulting quality is higher than either party could achieve alone, and the improvement is repeatable across runs because the loop structure forces explicit iteration rather than hopeful single-pass production.

"Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent."

New · 04

Coined · Engineering Practice

Harness as Living Assumption Log.

A harness is not just an execution scaffold — it is a record of the current model's known limitations. Each component encodes one assumption: "the model cannot do X reliably without this scaffold." When a new model is released, the correct practice is to re-examine each component against its assumption, test whether the assumption still holds, and strip what doesn't. The harness becomes a living log: as assumptions go stale, components are removed; as new limitations are discovered, new components are added. This reframes maintenance from "keeping the old code working" to "keeping the assumptions honest."

V · The Field Card

When to reach for which.

Field Card

Triggers & deployments

An LLM agent is evaluating its own work: Reach for Inversion + LLMs as Poor Self-Critics. Structural separation is not optional — it is a prerequisite. Tune the evaluator for skepticism separately. Self-praise is the default; external critique must be engineered in.
A quality target feels too subjective to measure: Reach for Operationalizing Subjectivity. Decompose the holistic judgment into the smallest specific questions that sum to it. "Beautiful?" → "Coherent whole? Original decisions? Solid craft? Usable?" Each question can be graded; the holistic one cannot.
A new model has been released: Reach for Harness as Living Assumption Log + Simplicity. Re-examine each harness component: what assumption does it encode? Does the new model satisfy that assumption natively? If yes, remove the component. If no, keep it.
An agent's behavior changes as a long task runs: Reach for Context Anxiety. Distinguish anticipatory truncation (context anxiety — fix: context reset) from actual context exhaustion (fix: compaction). The symptoms can look similar; the interventions are different.
Building a producer-critic feedback loop: Reach for The GAN Pattern + Sprint Contracts. Pre-commit success criteria before production begins (contracts). Tune the critic for skepticism separately from tuning the producer for quality. Use live interaction (Playwright, etc.) not static artifacts for the critic's evidence.
A complex harness is expensive and slow: Reach for First Principles + Simplicity. Remove one component at a time. Review the impact on output quality. If quality holds, the assumption was stale. If quality drops, the assumption was still load-bearing. Iterate until only load-bearing components remain.

VI · Coda

The moving frontier.

The post's most durable contribution is a posture, not a technique. The posture is: treat every scaffold as a hypothesis about current model limitations, run experiments to test whether it still holds, and update the scaffold accordingly. This is the scientific method applied to AI engineering. It sounds obvious. It is rarely practiced.

The space of interesting harness combinations doesn't shrink as models improve. Instead, it moves — and the interesting work for AI engineers is to keep finding the next novel combination. — Prithvi Rajasekaran, Anthropic Engineering

The latticework this builds is about evolving systems. Not systems that were designed well once and then maintained. Systems where the design itself is understood as a living document of assumptions — assumptions that must be tested, updated, and eventually discarded as the frontier advances. The wombat's gut evolved under constraint. Rajasekaran's harness evolves under a different kind of constraint: the boundary of what the model can do alone. Both cases produce the same lesson: form follows constraint, and when the constraint changes, the form must change with it.

★ END ★

A latticework reading of Anthropic Engineering · Harness Design for Long-Running Apps, against fs.blog/mental-models.

The Evolving Harness.