What is an agent harness?

The orchestration layer between the model and the world it's acting on — the code that gives the model tools, applies its outputs, captures results, and drives the loop. Anthropic, LangChain, and Martin Fowler all use the term. Some teams say agent runtime to mean the same thing.

What's the difference between an agent harness and an agent runtime?

Used interchangeably in practice. When teams want to distinguish: harness = the orchestration shell (the loop, tool calls, prompts), runtime = the compute substrate the harness runs on (the VM, the container, the host). This article is about the runtime layer.

Where should I run an agent harness in production?

On a substrate with five properties: persistent state across sessions, sub-second resume from sleep, fork from running state, a stable network identity, and crash survival. Persistent microVMs (boxd, sprites.dev) fit all five. Ephemeral sandboxes, containers, and serverless functions only fit a subset.

Can I self-host an agent harness substrate?

Yes — boxd ships as one Rust binary that runs on any host with KVM. Drop it in your VPC or your country and run agents on infrastructure you control. See self-hosted boxd for the operational guide.

How is this different from running my agent on a VPS?

A long-lived VPS works, but it doesn't sleep when idle (you pay 24/7), doesn't fork from running state in milliseconds, and doesn't ship with the per-VM public IP / DNS / TLS plumbing that a harness needs to be reachable. The persistent-microVM substrate is shaped for the workload; the long-lived VPS is shaped for a web service.

← Back to blog

By Hidde KehrerMay 4, 20268 min read

Where to run an agent harness in production

A coding agent runs for hours. It reads files, writes code, runs tests, reads logs, fixes the bug, runs them again. The loop is the work. Anthropic calls the structure that holds that loop a harness — the orchestration layer between the model and the world it's acting on. LangChain calls it the same thing. Some teams say agent runtime. The vocabulary is settling fast.

There's already good writing on what a harness is. Anthropic's piece on effective harnesses for long-running agents covers the structural choices. LangChain wrote up the anatomy. Martin Fowler published the harness-engineering essay. Read those.

This piece is about the question those don't answer: where does the harness live in production?

What "production" means for a harness

The harness has properties most cloud primitives weren't shaped for.

It's stateful. Session history, scratchpad, intermediate files, checkpoints, partial work — the agent's working memory lives across the entire loop. Lose it and the next iteration starts from a worse position than it should.

It runs long. Minutes, hours, sometimes days. A coding agent landing a multi-step refactor is not making one API call; it's making thousands, with file system state evolving between them. Same shape for any LLM harness running real tool-use loops.

It needs to fork. Try this branch is the natural shape of agent work. A planner spawns three explorers; each picks up the same state and runs forward differently. Cloning a process from disk does not capture the in-memory state that matters.

It needs an address. When a tool call returns asynchronously — webhook, scheduled job, human-in-the-loop — the response has to land somewhere stable. A new container with a new IP every run does not give you that.

It dies. Models fail mid-call. Tools time out. The host you're running on reboots. The harness needs to come back exactly where it was, not from the last manual snapshot from twenty minutes ago.

If your substrate doesn't support those five properties, you're going to spend a lot of time writing harness code that should have been infrastructure.

Why the obvious answers fall short

Sandboxes (E2B, code interpreters, Daytona's old product). Built for ephemeral one-shot execution. Great for run this snippet, return the result. Wrong shape for run this loop for six hours with mutable state across iterations. E2B has shipped pause/resume; checkpoint/fork is what's still on their roadmap. The whole category is racing toward persistence — until that lands fully, you're partway between rebuilding state every session and operating it yourself.

Codespaces. Stops after thirty minutes of inactivity. Bills per running minute. Session state lives on a Codespace volume, but the long-running compute model isn't there. You can keep one alive 24/7, but the bill rounds up fast and the hibernation primitives are not shaped for sub-second wake.

Containers on ECS / Cloud Run / Fly Machines. Stateless services with mounted volumes. Volumes persist; the process and its memory don't. Restarting the container means rehydrating from disk, which means writing code to checkpoint everything you care about. Possible. Tedious. Easy to get subtly wrong, especially under partial failure.

Lambda / Modal / serverless functions. Optimised for short, parallel, stateless work. Different primitive entirely. Modal's serverless GPU is the right answer for run inference at scale. It is the wrong answer for give my agent a workspace it owns.

A laptop or a long-lived EC2. Works for development. The trade-offs show up in production: no millisecond fork, no idle hibernation, and a 24/7 bill even when the agent is between sessions — which is most of the time.

The pattern: each of these is shaped for a different problem. Run a harness on any of them and you spend your engineering budget bridging the gap.

What an agent harness substrate actually needs

Five properties, mapping to the five harness traits above.

Persistent by default. State survives between sessions without writing checkpoint code. The disk is the disk. The processes that were running stay running.
Cheap when idle. A persistent machine that bills you for compute it isn't doing is just an expensive container. The substrate should sleep the VM when no one's working and bill near zero for that time.
Sub-second resume. Sleep is only useful if waking up is fast. If resuming the agent's environment takes thirty seconds, you'll feel it on every iteration. Sub-second feels like the machine never slept.
Forkable from running state. Not from a container image. From the actual machine — memory, disk, in-flight processes. Sub-100ms is the difference between fork three explorers being cheap and being something you avoid.
A stable network identity. Per-VM public IP, real DNS, real TLS. The harness is a long-running thing in the world; tools come back to it; humans connect to it; webhooks land on it.

There's no fundamental reason a single substrate can't have all five. They just don't all show up together in most cloud primitives, because most cloud primitives were shaped for a different decade's workloads.

What we built

boxd is a substrate for harnesses. The five properties above are the design — and they fall out of one architectural bet, that the agent lives inside the machine.

Persistent VMs. Every machine survives between sessions. State, env, processes — all preserved. No checkpoint code; the disk and the running processes are the checkpoint.
Costs nothing while it sleeps. Idle VMs hibernate. You pay for what you run, not for what you keep around — see pricing.
Sub-millisecond resume. Wake from sleep in under a millisecond. The harness's next iteration starts as if the machine never paused.
60ms fork of the whole machine. Copy-on-write fork captures memory, disk, and in-flight processes. Spawn three explorers from the same state, run them forward independently.
Per-VM public IPv4. Every VM gets a routable address from our own pool. SSH in. It's yours.

That last bit matters more than it sounds. SSH is the API. Your harness — Claude Code, OpenCode, an LLM-driven shell loop, or something you wrote yourself — connects in the same way you would. No SDK. No auth-token dance. Your SSH public key is your identity.

A working Claude Code harness on boxd looks like this:

ssh agent.example.boxd.sh "claude code --resume"

The agent picks up where it left off. Files where you left them. Processes where you left them. If the harness forks a worker, that worker is on the same network with its own IP. If it calls a webhook, the webhook's response comes back to a stable address.

When the model hangs or a tool call times out and the harness loop crashes — the machine survives. SSH back in. Resume. The architecture absorbs the failure mode instead of leaking it into the harness code.

When not to use a persistent VM

Honesty matters. Not every agent workload wants this primitive.

Short, stateless code-eval. Run this Python snippet, return stdout. Use E2B or a code-interpreter sandbox. Persistence is overhead.
High-volume one-shot inference. Thousands of independent calls a second, no shared state. Use Modal or Lambda. The substrate cost of a per-job VM does not pay for itself.
Pure GPU training jobs. Ephemeral training pipelines on dedicated GPU clusters are a different problem. We do not solve that.

If your agent's loop is small, fast, and forgettable — go ephemeral. If your agent's loop is long, stateful, or branching — go persistent.

Most coding agents in 2026 are the second kind.

What to take from this

Anthropic and LangChain define the harness. That work matters; building a good harness on top of a hostile substrate is most of why agent infrastructure feels brittle today. But the harness is only half the system. The other half is what holds it.

A long-running agent harness needs a substrate that is stateful, cheap when idle, fast to resume, fast to fork, and reachable on a stable address. boxd is shaped that way because that's the workload we built for.

The model picks the action. The harness runs the loop. The substrate holds the state. Pick a substrate that was shaped for the work.

If you're shopping in this category, the two products in our actual lane are exe.dev and Sprites — see boxd vs exe.dev and boxd vs sprites.dev for honest head-to-head comparisons. If you're hosting MCP servers or running Claude computer use, the same substrate properties apply.

Last verified: 2026-05-04. This article is informational, not legal advice. Send corrections to hello@boxd.sh.

Frequently asked

What is an agent harness?: The orchestration layer between the model and the world it's acting on — the code that gives the model tools, applies its outputs, captures results, and drives the loop. Anthropic, LangChain, and Martin Fowler all use the term. Some teams say agent runtime to mean the same thing.
What's the difference between an agent harness and an agent runtime?: Used interchangeably in practice. When teams want to distinguish: harness = the orchestration shell (the loop, tool calls, prompts), runtime = the compute substrate the harness runs on (the VM, the container, the host). This article is about the runtime layer.
Where should I run an agent harness in production?: On a substrate with five properties: persistent state across sessions, sub-second resume from sleep, fork from running state, a stable network identity, and crash survival. Persistent microVMs (boxd, sprites.dev) fit all five. Ephemeral sandboxes, containers, and serverless functions only fit a subset.
Can I self-host an agent harness substrate?: Yes — boxd ships as one Rust binary that runs on any host with KVM. Drop it in your VPC or your country and run agents on infrastructure you control. See self-hosted boxd for the operational guide.
How is this different from running my agent on a VPS?: A long-lived VPS works, but it doesn't sleep when idle (you pay 24/7), doesn't fork from running state in milliseconds, and doesn't ship with the per-VM public IP / DNS / TLS plumbing that a harness needs to be reachable. The persistent-microVM substrate is shaped for the workload; the long-lived VPS is shaped for a web service.

Try it now

No signup. No install. Just SSH.

$ ssh boxd.sh

Built by Azin Tech in Amsterdam. Open for early access.

Where to run an agent harness in production

What "production" means for a harness

Why the obvious answers fall short

What an agent harness substrate actually needs

What we built

When not to use a persistent VM

What to take from this

Frequently asked

Read next

boxd vs sprites.dev: two bets on persistent agent compute

Persistence beats ephemeral

The agent lives inside the machine

Try it now