What Is Harness Engineering for AI Agents

What Is Harness Engineering for AI Agents

Most people blame the model when an AI agent messes up. The agent wrote broken code, skipped a step, or confidently declared a task complete when it clearly wasn’t. So you switch to a newer model, or you rewrite the prompt, and for a while things seem better. Then the same kind of failure shows up again.

The thing is, the model probably isn’t the problem. The problem is what’s around the model — or more accurately, what isn’t. That’s what harness engineering is about. It’s a fairly new term, but the problem it solves has been annoying AI developers since agents started doing real work.

So basically, harness engineering is the discipline of building the environment that makes an AI agent actually reliable. Not smarter. Reliable. There’s a big difference, and once you understand it, the way you build and work with AI agents changes quite a bit.

Where This Term Came From

The term got its formal definition through an OpenAI post by Ryan Lopopolo, published on February 11, 2026, built on the experience of shipping a production application with zero human-written code. That’s a pretty wild claim, so let me put some numbers on it: his team spent five months building a production product with zero manually written lines of code. The codebase reached one million lines, managed across roughly 1,500 automated pull requests.

The humans weren’t writing code. They were designing the environment.

Then Mitchell Hashimoto gave the thing a name in early February 2026. His framing is why the term stuck: every time the agent makes a mistake, don’t just hope it does better next time. Engineer the environment so it can’t make that specific mistake the same way again.

Hashimoto is the co-founder of HashiCorp and the person who created Terraform, so when he publishes something about engineering practices, people pay attention. His core discipline: every time an AI agent makes a mistake, engineer a permanent fix into the agent’s environment so that mistake becomes structurally impossible to repeat. The formula he gave the field is simple: Agent = Model + Harness.

That formula is easy to dismiss as a slogan. But actually think about it. You could swap the model for a competitor and maybe see a 10–15% shift in output quality. Change the harness, and you change whether the whole system works at all. That’s a very different kind of leverage.

Martin Fowler followed with a formal taxonomy. At ThoughtWorks, Martin Fowler and Birgitta Böckeler published the definitive framework in April 2026: harness engineering is the discipline of designing the systems, constraints, and feedback loops that wrap around AI agents to make them reliable in production. Or more simply: everything in an AI agent except the model itself. The tools it can use, the permissions it’s granted, the state it persists, the tests that verify its work, the logs that make it observable, the guardrails that keep it safe, and the recovery mechanisms that handle failures.

What the Harness Actually Does

If prompt engineering optimizes a single question-answer exchange, and context engineering manages what information the model can see at any moment, harness engineering is the layer above both of those. Prompt engineering improves one turn. Harness engineering determines what the agent can and cannot do across all of them.

Think of it like this. The AI model is the engine — powerful, fast, capable of doing incredible things. But an engine without a vehicle around it doesn’t go anywhere useful. The harness is the vehicle: steering, brakes, fuel system, seatbelts, dashboard. Without it, you have a fast thing with no direction and no way to stop.

A harness is not the agent itself. It is the complete infrastructure that governs how the agent operates: the tools it can access, the guardrails that keep it safe, the feedback loops that help it self-correct, and the observability layer that lets humans monitor its behavior.

A production-ready harness has five layers. Understanding what each one does is the difference between an agent that kind of works in a demo and one that you can actually trust with real tasks.

Tool orchestration is the control plane. It decides which tools the agent can call, in what order, and what happens when a tool fails or times out. Without this layer, agents will call tools in the wrong sequence, retry the same failed call six times, or worse — call a destructive tool without any validation. The model returns a structured tool call; the harness validates the schema, checks permissions, executes, and injects the result back. This prevents prompt injection from escalating to arbitrary code execution.

Verification loops are automated QA steps that run during execution, not just at the end. This is where a lot of teams mess up. They let the agent run through an entire task and only check the final output. But agents frequently mark a task complete without verifying the outcome — this is called victory declaration bias, and it’s one of the most common ways agent work quietly degrades. Verification loops catch this mid-execution.

Context and memory manage what the agent remembers between steps and across sessions. Without proper memory management, as the context window fills up, models “panic” and rush to finish, cutting corners to avoid running out of space. Faros AI calls this “context anxiety,” and once you’ve seen it in a real deployment, you can’t unsee it. The agent starts producing sloppier and sloppier work as the context limit approaches, then wraps everything up in a rush.

Guardrails are the safety layer — budget limits, permission boundaries, human-in-the-loop triggers for risky actions. Dangerous actions are first drafted, then explicitly committed. This draft-commit pattern alone prevents a lot of painful rollbacks.

Observability is probably the most underbuilt layer in most teams’ setups. You need telemetry, execution traces, and audit logs so you can actually see what the agent did and why. Without this, debugging a failed agent run is basically impossible. You’re reading through logs trying to reconstruct decisions that happened 47 tool calls ago.

The Three Failure Modes That Harness Engineering Fixes

People have catalogued a lot of ways AI agents fail in production, but most of them trace back to three core patterns.

The first is victory declaration bias, which I mentioned above. The agent decides it’s done before it actually checks. It’s surprisingly common and genuinely annoying to debug because the output looks finished — until you actually run it.

The second is one-shotting overreach. Agents often try to tackle an entire problem in one go, which produces an undocumented tangle of changes. Instead of breaking a task into steps, the agent tries to handle everything at once. The result is a pile of changes that are hard to review and harder to roll back. Good harness engineering forces decomposition — the agent works in smaller, verifiable chunks.

The third is self-evaluation bias. When agents assess their own output, they score it high. Even on tasks with objective pass/fail criteria, the agent spots a problem, talks itself into believing it’s not serious, and approves work that should fail. This is the sneakiest one. The agent isn’t lying — it genuinely “thinks” the work is good. The fix, interestingly, borrows from generative adversarial networks: separate the generator from the evaluator completely. In a GAN, two neural networks compete — one generates, one judges — and that adversarial tension forces quality up. In a harness, this means running a separate evaluator agent that has no shared context with the generator. The Evaluator must be architecturally separate from the Generator — shared context reintroduces the same bias you’re trying to eliminate.

I spent way too long trying to figure out why one of my own agents kept approving its broken outputs before I realized the evaluator was seeing the same conversation history as the generator. Of course it was agreeing with itself. Once I separated the two into different context windows, the accuracy on catching its own errors went up noticeably.

The AGENTS.md File — Small Thing, Big Impact

One of the most practical tools in harness engineering is embarrassingly simple: a text file.

An AGENTS.md file is the foundational component of a harness. It sits at the root of a repository and tells the agent how to behave in that codebase: project structure, build and test commands, coding conventions, and a list of anti-patterns the team has identified from previous agent sessions. It grows incrementally — one rule added each time an agent repeats a mistake.

This is the direct implementation of Hashimoto’s original principle. Every time something goes wrong, you don’t just fix it in that session. You add a rule to the file so the next session starts with that knowledge already baked in. Over time, the AGENTS.md file becomes a record of every failure the agent has made and the constraint that prevents it from making the same mistake again.

It sounds almost too simple, and I think that’s why people skip it at first. You’re dealing with all this impressive technology and the fix is… a markdown file with rules in it? But it works. You start small, then add rules when the agent repeatedly fails in the same place. This is the same pattern Hashimoto described: every time the agent makes a mistake, add the instruction that prevents that mistake from repeating.

The mindset shift this creates is actually more valuable than the file itself. You stop thinking “this model is bad at this thing” and start thinking “my system allowed this failure.” That reframing moves the locus of control back to you, which means you can actually do something about it.

What LangChain Proved in March 2026

The clearest real-world proof that harness investment beats model investment came from the LangChain engineering team earlier this year. In March 2026, the LangChain engineering team moved their coding agent from the 30th to the 5th place on Terminal Bench 2.0 without changing the underlying model at all; the improvement was achieved entirely by optimizing the harness.

30th to 5th. Same model. Better harness.

That’s not a small jump. And it happened without paying for a newer or bigger model, without retraining anything, without waiting for a release. Just harness work.

Research from Faros’s AI Engineering Report 2026 found that AI adoption is producing code changes that are larger, more complex, and carry a wider blast radius than before. More code, faster, doesn’t equal better outcomes. Teams that figure out harness engineering are the ones actually getting reliable output from their agents.

Building vs Buying — Where to Put Your Energy

Organizations should buy the commodity plumbing — such as managed runtimes, basic telemetry, and control planes from hyperscalers or open-source frameworks — and build proprietary integrations like domain-specific tools, custom evaluation datasets, and environment maps.

This is solid advice. The generic parts of a harness — the basic logging, the session management, the standard tool interfaces — you don’t need to build those from scratch. Frameworks like LangChain, OpenAI’s Codex harness, and Claude Code all provide these out of the box. What you need to build is the domain-specific stuff: the verification logic that understands your data, the guardrails that match your risk tolerance, the tools that connect to your actual systems.

A robust harness functions as an intertwined three-layer system designed to govern agentic workflows. The Information Layer dictates exactly what data the agent can observe and what active tools it has the authority to invoke at any precise second. That information layer is where most of the custom work goes, and it’s also where most of the value is. Generic tools give the agent generic capabilities. Your custom tool layer gives it the specific capabilities your work actually requires.

One thing that trips teams up: they build the harness for the current model. But harness engineering isn’t a fixed architecture. It’s a system recalibrated with every new model release. The first question after any major upgrade isn’t “what can I add?” It’s “what can I remove?” Some constraints become unnecessary overhead as models improve. A harness component that was critical because the model couldn’t handle long tasks might become dead weight after a context window upgrade. Review the harness when you upgrade the model.

Measuring Whether It’s Actually Working

The productivity gains from harness engineering are real, but they’re easy to miss if you’re measuring the wrong things. AI coding tools have transformed the day-to-day work of software developers faster than the industry’s measurement frameworks can keep up.

Key metrics to track include cost per merged PR, time-to-merge for agent-assisted PRs, review velocity relative to PR size, compute spend per developer, code churn on agent-touched code, first-pass success rate, and defect escape rate.

That last one — defect escape rate on agent-touched code — is probably the most telling metric, and also the hardest to get without good observability. If agent-written code is producing more bugs in production than human-written code, you have a harness problem, not a model problem.

Linking agent sessions to PRs enables teams to connect each agent session to the PR it produced, label sessions by engineer intent, and trace bugs or incidents to specific agent-assisted PRs. This linkage is the foundation everything else rests on. Without it, you can’t tell whether a production bug came from an agent session and you can’t trace it back to a harness failure to fix.

Also worth tracking: reviewer confidence. If engineers are spending more time reviewing agent PRs than they did reviewing code they wrote themselves, the harness isn’t providing enough context. Good harness design means the agent shows its work — the reasoning behind its changes, the tests it ran, the things it checked. A reviewer should be able to audit an agent PR faster than a human one, not slower.

What to Actually Do This Week

If you’re using any AI coding agent right now — Claude Code, Cursor, Codex, Gemini CLI, doesn’t matter which — and you don’t have an AGENTS.md or CLAUDE.md file in your repository, start there. Create one. Put in the project structure, the build commands, the conventions your team follows.

Then the next time the agent makes a mistake, don’t just fix the mistake. Add a rule to the file. One rule. Whatever would have prevented that specific failure. Repeat every time something goes wrong.

This is harness engineering at its most basic, and it’s also where the compounding starts. The first month, you add maybe five or six rules. Three months later, you have a file that encodes dozens of hard-won lessons, and your agent sessions start significantly cleaner than they did at the beginning. The agent isn’t smarter. Your system is.

After that, look at your tool layer. What does the agent need to verify its own work? If you’re doing code generation, does the agent have access to a test runner? A linter? A static analysis tool? If not, you’re relying on the model to catch its own errors, which is exactly the self-evaluation bias problem.

And once you have the basics in place, add observability. At minimum, log every tool call the agent makes and the result it got. This sounds like extra work until the first time an agent run fails mysteriously at step 23 and you have to figure out why.

The teams getting consistent, reliable output from AI agents right now aren’t using better models. They’re building better systems around the models they already have.

That’s harness engineering. And it turns out, it was the missing piece the whole time.

Post a Comment

Previous Post Next Post