Blog ¦ Archives

Integrating Computer-Use (OSWorld) into Harbor

Integrating Computer Use into Harbor

Full Project repo link (Harbor fork).

Harbor is a framework (created by the Terminal-Bench team) for evaluating and optimizing LLMs or AI agents at scale using a variety of benchmarks. It supports the creation of benchmarks and environments, and generating rollouts for reinforcement learning optimization. Harbor can run any agent (including Claude Code, OpenHands, Aider, Codex CLI, and others) against task suites such as SWE-Bench, Terminal-Bench, and Aider Polyglot, managing thousands of parallel sandbox environments.

At its core, Harbor’s abstraction is simple: you define a task (which consists of an instruction, an environment, and a verifier), point an agent at it, and collect the resulting trajectories and rewards (at any scale concurrently). Harbor takes care of the environment lifecycle, parallel task execution, trajectory (in Harbor known as Agent Trajectory Interchange Format or ATIF) recording, and metric aggregation.

Until now, Harbor has only supported text-only benchmarks: coding tasks where agents interact via terminal or file system APIs, typically in headless Linux environments using Docker images or cloud sandboxes. This approach works well for command-line and programming evaluations, but doesn’t address a crucial class of capabilities: computer-use agents, models that interact with full desktop GUIs, using screenshots, mouse movements, and keyboard input just as humans do.

This post describes the integration of OSWorld into Harbor, enabling evaluation of computer use agents on real Ubuntu and Windows desktops running in Daytona cloud sandboxes and QEMU/KVM VMs on bare-metal servers.

Computer-Use and knowledge work benchmarks are becoming more relevant now

Computer use is the current frontier for foundation models. The latest releases from Anthropic (Opus 4.6) and OpenAI (GPT-5.4) treat computer-use as a first-class capability, featuring it at the top of their model cards. OSWorld benchmarks (March 6th, 2026):

GPT-5.4, Sonnet 4.6, and Opus 4.6 crossed the human baseline (72.4%) on OSWorld-verified. These aren't marginal improvements. The gap between GPT-5.2 and GPT-5.4 on OSWorld is 28 points in a single generation.

Beyond screenshot-and-click models, Standard Intelligence's FDM-1 (Feb 2026) takes a different approach: trained on 11 million hours of screen recordings, it processes nearly 2 hours of continuous video in a single session at 30 FPS. Their video encoder compresses this into ~1M tokens, 50–100x more efficient than VLM-based approaches. FDM-1 handles continuous motion (scrolling, dragging, 3D manipulation) that screenshot-based agents fundamentally can't, and was fine-tuned for autonomous vehicle control with under an hour of driving footage.

Many tasks can be solved with a terminal-only approach (no GUI/desktop needed), as in Terminal-Bench, but a capable model shouldn't be limited to terminal-only interaction. It should be able to execute any task regardless of the input-output modality.

OSWorld Task Breakdown

OSWorld contains 418 tasks across real desktop environments. Each task gives an agent a natural language instruction and a live desktop. The agent must complete the task using only screenshots and mouse/keyboard input.

Ubuntu: 369 tasks across 10 categories

Windows: 49 tasks across 4 categories

Ubuntu tasks cover browser automation, image editing, office productivity, system administration, email, media playback, and code editing. Windows tasks focus on Office applications and cross-application workflows. Every task includes a programmatic evaluator: not LLM-as-judge, but deterministic comparison of file contents, UI state, or application output.

Technical Implementation

System Architecture

The Adapter: OSWorld → Harbor Tasks

The adapter (adapters/osworld/adapter.py) reads OSWorld's task index (test_all.json for Ubuntu, test_windows.json for Windows) and generates one Harbor task directory per task:

win_excel__3aaa4e37-dc91-482e-99af-132a612d40f3/
├── task.toml           # timeouts, resource limits, os_type
├── instruction.md      # natural language task for the agent
├── environment/
   └── Dockerfile      # unused for QEMU/Daytona, kept for compatibility
└── tests/
    ├── test.py         # runs eval_runner  writes reward.txt
    └── task_config.json # OSWorld evaluator config (metric, getter, expected)

task_config.json carries the original OSWorld evaluation specification: which metric function to use (compare_csv, compare_table, exact_match, etc.), how to extract the current state from the desktop (getter), and the expected result. The test script invokes the eval runner inside the VM, which resolves the metric, runs the getter, compares, and writes a binary score (0 or 1) to /logs/verifier/reward.txt.

The Agents

I built two simple agents to test the end-to-end integration: anthropic_cua and openai_cua. Both implement the same loop pattern: take screenshot, send to model, execute returned actions. They differ in API mechanics.

anthropic-cua uses Claude's Computer Use beta. The agent auto-detects the model tier and adjusts parameters:

Opus 4.6 introduced adaptive thinking (the model decides when to think, controlled by effort instead of a fixed budget_tokens), the zoom action for inspecting screen regions at full resolution, and the computer_20251124 tool version. The system prompt includes self-verification guidance per Anthropic's docs: "after each step, take a screenshot and evaluate the outcome."

while not done:
    screenshot = await desktop.take_screenshot()
    response = client.beta.messages.create(
        model=model,
        messages=[..., screenshot],
        tools=[computer_use_tool],
        betas=[beta_flag],
        thinking={"type": "adaptive"},     # Opus 4.6 only
        output_config={"effort": "high"},  # Opus 4.6 only
    )
    for action in response.content:
        match action.type:
            case "click":      await desktop.mouse_click(x, y, button)
            case "type":       await desktop.keyboard_type(text)
            case "key":        await desktop.keyboard_press(key)
            case "scroll":     await desktop.mouse_scroll(x, y, delta)
            case "screenshot": pass  # next iteration takes fresh screenshot
            case "wait":       await asyncio.sleep(seconds)

openai-cua uses GPT-5.4's Responses API with the GA computer tool. Key differences:

  • client.responses.create() with previous_response_id for conversation continuity, removing the need for manual message history management.
  • GPT-5.4 returns batched actions (multiple actions per turn). The harness executes all actions in order, then takes one screenshot for the next turn.
  • reasoning: {"effort": "high"} on every turn for consistent deep reasoning.
  • Handles pending_safety_checks by acknowledging them in the next computer_call_output.
  • System prompt grants explicit "pre-approval" for password entry to prevent safety-check pauses.
response = client.responses.create(
    model="gpt-5.4", input=[user_msg, screenshot], tools=[{"type": "computer"}],
    reasoning={"effort": "high"}, truncation="auto",
)
while has_computer_call(response):
    actions = response.output[call].actions  # batched actions
    for action in actions:
        match action["type"]:
            case "click":    await desktop.mouse_click(action["x"], action["y"])
            case "type":     await desktop.keyboard_type(action["text"])
            case "keypress": await desktop.keyboard_press(action["keys"])
            case "scroll":   await desktop.mouse_scroll(action["x"], action["y"], ...)
    screenshot = await desktop.take_screenshot()
    response = client.responses.create(
        previous_response_id=response.id,
        input=[{"type": "computer_call_output", "call_id": call_id, "output": screenshot}],
    )

Both agents compress screenshots to JPEG (quality 60) before sending to prevent context from growing unboundedly. The actual image format is detected from magic bytes, so each API receives the correct media type.

DesktopInterface: The Agent-Environment Contract

Both agents are decoupled from the environment through DesktopInterface. This is the only API an agent needs to implement a CUA loop:

To add a new CUA agent (e.g. UI-TARS, FDM-1): implement BaseAgent, register in AgentName enum and factory.py, and call DesktopInterface methods in your loop. The agent never needs to know whether it's running on QEMU or Daytona.

Daytona: Cloud Sandbox Execution

For Daytona, the environment creates an ephemeral sandbox per trial from a base snapshot (ubuntu-large for Ubuntu, windows-base for Windows). The DesktopInterface wraps Daytona's computer_use SDK:

screenshot_bytes = await sandbox.computer_use.screenshot.capture()
await sandbox.computer_use.mouse.click(x, y, button="left")
await sandbox.computer_use.keyboard.type(text)
await sandbox.computer_use.recording.start()

This is the same DesktopInterface API the agent calls. The agent doesn't know or care whether it's talking to a local QEMU VM or a Daytona sandbox in the cloud.

For Ubuntu, a shell setup script installs applications and packages at sandbox creation (~2-5 min). For Windows, a Python setup script installs evaluation packages (openpyxl, pandas, lxml, etc.) and ffmpeg, and Harbor deploys the OSWorld desktop_env evaluators with safe import wrappers that gracefully skip heavy dependencies like easyocr or librosa that aren't needed for most task categories.

QEMU/KVM: Bare-Metal Execution

Each trial boots a QEMU VM from a copy-on-write overlay on a pre-baked base image:

qemu-img create -f qcow2 -b ubuntu.qcow2 -F qcow2 overlay_trial_xyz.qcow2
qemu-system-x86_64 -enable-kvm -cpu host -m 4G -smp 1 \
    -drive file=overlay_trial_xyz.qcow2 -nographic \
    -net user,hostfwd=tcp::15000-:5000

The VM runs a Flask HTTP server on port 5000 with two endpoints: /screenshot (returns PNG) and /execute (runs a command, returns JSON with stdout/stderr/returncode). On Ubuntu, xdotool handles mouse and keyboard input. On Windows, pyautogui does the same through the execute endpoint. Screen recording uses ffmpeg with x11grab (Ubuntu) or gdigrab (Windows).

The bake step (scripts/osworld/bake-qcow2.sh) is critical: it boots the base VM once, installs all evaluator dependencies (the desktop_env package from OSWorld, Playwright with Chromium, xdotool, Python packages), configures applications (Chrome remote debugging, VLC HTTP interface, LibreOffice save formats), and commits changes back to the qcow2. All subsequent COW overlays inherit these dependencies without re-installing.

Concurrent trials each get their own overlay and unique port allocation, so 20–50 VMs can run simultaneously on a single bare-metal server without interfering with each other.

Trajectory Recording and Viewer

Every trial produces:

  • ATIF trajectory (trajectory.json): structured record of every agent step: screenshot, model response, action, timing, token usage
  • Screen recording (recording.mp4): continuous video of the desktop during the trial
  • Agent logs (trial.log): environment setup, task setup steps, errors, verifier output
  • Verifier output (verifier/output.txt, verifier/reward.txt): evaluator logs and the final binary score

Harbor's built-in viewer (harbor view) serves a web UI that renders trajectories with step-by-step screenshots, action overlays, token usage breakdowns, recording playback, and side-by-side comparison across trials. This is essential for debugging agent failures: you can see exactly which screenshot the model misinterpreted, which click missed its target, or where the agent entered an unrecoverable loop.

Running It

Replace anthropic-cua with openai-cua in any example to use GPT-5.4 instead. Add --model anthropic/claude-opus-4-6 to use Opus 4.6.

Single Ubuntu task on Daytona

# Claude (default: Sonnet 4.5)
harbor run --config examples/configs/osworld-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    -t os__94d95f96-9699-4208-98ba-3c3119edf9c2

# GPT-5.4
harbor run --config examples/configs/osworld-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    -t os__94d95f96-9699-4208-98ba-3c3119edf9c2 \
    --agent openai-cua

# Claude Opus 4.6
harbor run --config examples/configs/osworld-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    -t os__94d95f96-9699-4208-98ba-3c3119edf9c2 \
    --agent anthropic-cua --model anthropic/claude-opus-4-6

Single Windows task on Daytona

harbor run --config examples/configs/osworld-windows-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks_windows \
    -t win_excel__3aaa4e37-dc91-482e-99af-132a612d40f3 \
    --agent openai-cua

Full benchmark on Daytona (369 Ubuntu + 49 Windows)

# Ubuntu - 10 concurrent sandboxes
harbor run --config examples/configs/osworld-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    --n-concurrent 10 --agent anthropic-cua

# Windows - 4 concurrent sandboxes
harbor run --config examples/configs/osworld-windows-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks_windows \
    --n-concurrent 4 --agent openai-cua

Full benchmark on QEMU (bare-metal)

# Ubuntu - 20 concurrent VMs
harbor run --path ~/.harbor/data/osworld/tasks \
    --n-concurrent 20 --agent anthropic-cua --env qemu

# Windows - 10 concurrent VMs
harbor run --path ~/.harbor/data/osworld/tasks_windows \
    --n-concurrent 10 --agent openai-cua --env qemu

View results

harbor view --host 0.0.0.0 -p 8080 jobs/

Future Work

Speed: The current bottleneck is environment setup time, not agent execution. On Daytona Windows, pip-installing evaluation packages takes ~4 minutes per sandbox. Pre-baked snapshots would eliminate this entirely. On QEMU, Windows VMs take ~60 seconds to boot vs ~20 seconds for Ubuntu. UEFI firmware initialization is the culprit.

Other CUA agents: The DesktopInterface is agent-agnostic. Two agents are integrated: anthropic-cua (Claude Computer Use, supporting Sonnet 4.5 through Opus 4.6 with automatic parameter selection) and openai-cua (GPT-5.4 Responses API with batched actions). Both share the same environment, recording, and evaluation infrastructure. Adding UI-TARS, FDM-1, or any other CUA model requires only implementing the agent loop. The agent just needs to consume screenshots and emit mouse/keyboard actions through DesktopInterface.

Non-GUI agents on GUI benchmarks: An interesting question: can a coding agent like Claude Code solve OSWorld tasks without using the GUI at all? Many tasks (spreadsheet manipulation, file management, application configuration) can be accomplished through CLIs and scripting. Running Claude Code or OpenHands against OSWorld's GUI tasks through terminal-only interaction would measure how much of "computer use" actually requires visual understanding vs. just knowing the right commands.

Other benchmarks: WebArena (812 web tasks), BearCubs, and WebChoreArena all evaluate computer use in browser-specific contexts. The same QEMU and Daytona infrastructure can host these. The adapter layer converts benchmark-specific task formats into Harbor's task model, and the DesktopInterface already handles browser interaction through the same screenshot/click/type primitives.

RL optimization: The real purpose of running benchmarks at scale isn't just to measure scores. It's to generate training signal. Harbor's trajectory format (ATIF) captures everything needed for offline RL: state (screenshot), action (mouse/keyboard event), reward (binary from verifier), and next state. With thousands of trajectories from parallel execution, you have a dataset for training computer use policies through methods like DPO, GRPO, or filtered behavioral cloning. The infrastructure described here is the data generation pipeline for that loop.

© Marco Mascorro. Built using Pelican. Theme by Giulio Fidente on github.