Running Anthropic computer use on a remote VM
Anthropic shipped computer use in October 2024 and US search interest has climbed roughly 230% from December to March. The capability is straightforward to call: hand Claude a screenshot, get back keystrokes and mouse coordinates, apply them, screenshot again. The hard part isn't the API. It's the question nobody answers in the docs: where do you run the desktop the model is looking at?
Your laptop closes. Most cloud primitives don't ship with a display. The Anthropic reference container works on day one and hits a wall the first time you want to leave a session running, fork a state to try two paths, or check on a long-running task from your phone. This is a piece about building the thing that holds the desktop.
What computer use actually needs from its host
The model doesn't care about your infrastructure. It cares about getting a screenshot back fast, sending keystrokes that take effect, and finding the same screen state on the next turn. That translates to five concrete requirements:
A real graphical environment. X11 or Wayland, with a window manager, a browser, a terminal, a file manager. The model is using the same surfaces a human would. There's no Computer Use without a desktop.
Low-latency screenshot capture. The screenshot is the model's vision. Round-trip time between I think there's a button at (430, 280) and I see what happened when I clicked it is roughly the speed of the agent's thinking. Adding a slow display server adds it to every step.
Persistence between turns. The model's mental model assumes state continues. Open tabs stay open. Files stay where they were dropped. A login session from turn three is still valid on turn forty.
Survival across crashes. The agent makes mistakes. Sometimes those mistakes hang the browser, freeze a window, or leak memory until the session OOMs. If the host dies with the workload, you lose hours of progress. The host needs to come back.
A reachable address. You want to look at what the agent is doing. Sometimes mid-run. Sometimes from your phone. A VNC port behind a stable hostname is the difference between I can see what's happening and I'm guessing from the screenshots in the log.
If your host doesn't give you those five, you'll write infra code instead of agent code.
Why the obvious hosts fall short
Lambda / Cloud Run / serverless. No display server. No persistence. Different problem.
Plain Docker containers. You can ship a desktop in a container — docker.io/dorowu/ubuntu-desktop-lxde-vnc exists for a reason — but containers were not shaped to be long-running graphical environments. Display drivers, audio, GPU passthrough, font rendering, clipboard — all the things you take for granted on a real desktop are container-edge cases. The Anthropic reference image works precisely because it ships every workaround. Maintaining your own variant is more work than it sounds.
GitHub Codespaces. Has a remote desktop story via the VS Code GUI, but that's an editor session, not a Computer Use desktop. The model needs a generic desktop with arbitrary apps, not VS Code with extensions.
Your laptop with the Anthropic Docker image. Works for an afternoon. The laptop closes. The session dies. The agent gives up half-way through a task you'd hoped it would finish overnight.
A long-lived EC2 instance with X. Works. You're paying for a running desktop 24/7 even when no one's using it. You build the snapshot, networking, monitoring, and resume story yourself.
The pattern is familiar. Each option is shaped for a different workload; running Computer Use on top of it means writing the missing pieces.
What a computer use substrate looks like
Five properties, mapping to the five requirements above:
- Full Linux desktop — Ubuntu (or your distro), real window manager, real browser, real apps. Not a container minimised down to a single binary.
- A display server you can attach to — Xvfb for headless screenshots, Xpra or x11vnc when you want to watch. The model uses the first; you use the second.
- Persistent VM — runs across sessions. Browser tabs stay open. Logged-in services stay logged in. The agent's progress is the disk's progress.
- Forkable state — when the model takes a wrong turn, you want to roll back. When you want to try two approaches, you want to fork from the same starting state.
- Reachable on a stable address — VNC over TLS through a known hostname, available whether you're at your laptop or on your phone.
A persistent VM is the natural substrate for all five. It looks the most like a desktop because it is a desktop. The work is in making that VM cheap to keep around when the agent isn't running.
What we built
boxd is a persistent VM platform shaped for long-lived agent workloads, and Computer Use is exactly that.
- Persistent Ubuntu VMs. Install a desktop once. It stays installed. Boot it once. It stays booted.
- Sub-millisecond resume from sleep. Idle VMs hibernate. Cost goes to near-zero. When the next agent turn arrives, the desktop is up before the screenshot tool finishes its handshake.
- 60ms fork of the whole machine. Branch the run. Try a different sequence of clicks from the same state. Keep the one that worked.
- Per-VM public IPv4, real DNS, real TLS. VNC into your agent's desktop from anywhere. Watch what it's doing live, take over if it gets stuck, hand control back.
- KVM isolation. A real kernel per VM. The agent installs a sketchy browser extension to test something — it's contained.
A working Claude computer use setup on boxd looks roughly like this:
ssh agent.example.boxd.sh
# install Docker on Ubuntu 24.04
curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker $USER && newgrp docker
# the upstream image bundles Xvfb, x11vnc, noVNC, and the Streamlit UI.
export ANTHROPIC_API_KEY=sk-ant-...
docker run -d --name computer-use \
-e ANTHROPIC_API_KEY \
-v $HOME/.anthropic:/home/computeruse/.anthropic \
-p 8080:8080 -p 8501:8501 -p 6080:6080 -p 5900:5900 \
ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest
Then, from your laptop, tunnel the combined UI:
ssh -L 8080:localhost:8080 agent.example.boxd.sh
Open http://localhost:8080. You get the agent chat and the live desktop in one pane — that's the right entry point. Direct VNC isn't needed; if you want to attach a separate VNC client, add -L 5900:localhost:5900 to the SSH command and connect to vnc://localhost:5900.
When you walk away, the VM hibernates. When you come back tomorrow, the agent is still where it was — same browser tab, same logged-in session, same progress.
If you want to test what happens if Claude does it differently, fork the VM. Two parallel runs from the same state. Keep the better one.
When not to use a persistent VM
Honesty matters. Not every workload Claude could automate wants this primitive.
- Pure browser automation. If you're scraping a hundred URLs a minute and don't need a model in the loop, use Playwright or Puppeteer headless. They're cheaper, faster, and more reliable than running a model that drives a desktop.
- High-volume parallel agent runs. Thousands of independent short tasks. A pool of ephemeral browser sandboxes fits better than a fleet of full desktops.
- GPU-heavy workloads. If the agent is invoking a local model that needs a GPU, that's a different host. boxd does not solve that.
If the work is Claude operating a real computer over hours, with state that should persist — persistent VM is the right primitive. If the work is one-shot scraping at scale — it isn't.
What to take from this
Computer use is the rare AI capability that exposes the underlying compute model. Most agent workloads can pretend to be stateless if the harness is clever; Computer Use cannot. The model is using a desktop. That desktop has to live somewhere.
A long-running computer use VM needs persistence, a display server, sub-second resume, forkable state, and a stable network address. Pick a host that gives you those for free, and you spend your engineering time on the prompts, the tools, and the workflow — not on rebuilding the substrate.
That's what boxd is for. SSH in. Install a desktop. Give Claude access to a computer. The computer stays. If you're shopping persistent-VM substrates for agent work, see boxd vs sprites.dev and boxd vs exe.dev — the two products in our actual lane.
Last verified: 2026-05-04. This article is informational, not legal advice. Send corrections to hello@boxd.sh.
Frequently asked
- What is Anthropic computer use?
- A capability where Claude takes screenshots of a desktop, decides where to click and what to type, and the host applies those actions and sends the next screenshot. It lets the model operate any GUI app a human could.
- Can I run Anthropic computer use on my laptop?
- Yes for development; no for anything you want to leave running. The Anthropic Docker image works locally, but the moment you close your laptop, the agent's state and the running task die. For multi-hour or overnight workloads you need a persistent host.
- Why not just use a Docker container?
- Containers can ship a desktop (the Anthropic reference image is one), but they were not shaped for long-running graphical environments. Display drivers, fonts, audio, clipboard, GPU passthrough — all the things you take for granted on a real desktop are container edge cases. A persistent VM running a normal desktop is closer to the model's mental picture of what it's interacting with.
- What's the cheapest way to run computer use 24/7?
- A persistent VM that sleeps when no agent is active. boxd's idle VMs hibernate; cost goes to near-zero. The VM wakes in under a millisecond when the next computer-use call arrives. Compared to running a 24/7 EC2 instance with a desktop, the bill is materially lower.
- Can I watch the agent work?
- Yes — install x11vnc or Xpra alongside Xvfb on the VM and connect a VNC client to the per-VM hostname. You see what Claude sees, in real time. Useful for debugging, supervision, or taking over when the agent gets stuck.
- Should I use computer use or browser automation (Playwright / Puppeteer)?
- Computer use when the model needs to operate arbitrary GUI apps with no API. Playwright/Puppeteer when you're scraping web pages — they're cheaper, faster, and don't need a model in the loop.
Read next
Where to run an agent harness in production
Anthropic and LangChain define what an agent harness is. This piece answers the question they don't: where does the harness actually live in production?
Persistence beats ephemeral
Why persistent state is the foundation of everything boxd does — and what falls out for free.
boxd vs sprites.dev: two bets on persistent agent compute
Sprites and boxd agree on the architecture: persistent microVMs are the right primitive for AI agents. Where the two implementations diverge, honestly.
Try it now
No signup. No install. Just SSH.
Built by Azin Tech in Amsterdam. Open for early access.