Integrating Harbor for End-to-End CUA Evaluation at Scale with Tinker, OSWorld, Daytona, and Bare Metal

Posted on: Mon 09 March 2026 · By Marco Mascorro and Joan Cabezas

Quickstart

Full Project repo link (Harbor fork). Clone and install from source, then start running evaluations on Daytona computer-use:

git clone https://github.com/Mascobot/harbor.git && cd harbor
uv cache clean harbor && uv tool install --force --prerelease=allow .

# Run a single Ubuntu task with Claude
harbor run --config examples/configs/osworld-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    -t os__94d95f96-9699-4208-98ba-3c3119edf9c2 \
    --agent anthropic-cua

# Run the full benchmark (369 Ubuntu tasks) with GPT-5.4, 10 concurrent sandboxes
harbor run --config examples/configs/osworld-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    --n-concurrent 10 --agent openai-cua

For bare-metal QEMU, set up the host and bake a base image first (one-time):

bash scripts/osworld/bare_metal/setup.sh
bash scripts/osworld/bare_metal/bake/ubuntu.sh

Using Tinker with any vision model (Kimi K2.5, Qwen3):

uv sync --extra tinker
harbor run -c examples/configs/pyautogui-cua-qemu-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    -t os__94d95f96-9699-4208-98ba-3c3119edf9c2

More examples below in the "Running It" section.

Integrating Computer Use into Harbor

Harbor (from the Terminal-Bench team) is a framework for evaluating and optimizing AI agents and models in container environments at scale. You define a task (instruction + environment + verifier), point an agent at it, and collect trajectories and rewards across thousands of parallel sandboxes. Until now, Harbor only supported text-only benchmarks (SWE-Bench, Terminal-Bench, Aider Polyglot, etc) — headless Linux environments where agents interact via terminal or file system APIs.

This post adds computer-use agents to Harbor: models that interact with full desktop GUIs through screenshots, mouse, and keyboard. We integrated OSWorld as the benchmark, running on Daytona cloud sandboxes and QEMU/KVM VMs on bare-metal servers.

Computer-Use and knowledge work benchmarks are becoming more relevant now

Computer use is the current frontier for foundation models. The latest releases from Anthropic (Opus 4.6) and OpenAI (GPT-5.4) treat computer-use as a first-class capability, featuring it at the top of their model cards. OSWorld benchmark - taken from their own model cards (March, 2026):

OSWorld benchmark

GPT-5.4, Sonnet 4.6, and Opus 4.6 crossed the human baseline (72.4%) on OSWorld-verified. These aren't marginal improvements. The gap between GPT-5.2 and GPT-5.4 on OSWorld is 28 points in a single generation.

Beyond screenshot-and-click models, Standard Intelligence's FDM-1 (Feb 2026) takes a different approach: trained on 11 million hours of screen recordings, it processes nearly 2 hours of continuous video in a single session at 30 FPS. Their video encoder compresses this into ~1M tokens, 50–100x more efficient than VLM-based approaches. FDM-1 handles continuous motion (scrolling, dragging, 3D manipulation) that screenshot-based agents fundamentally can't, and was fine-tuned for autonomous vehicle control with under an hour of driving footage.

Many tasks can be solved with a terminal-only approach (no GUI/desktop needed), as in Terminal-Bench, but a capable model shouldn't be limited to terminal-only interaction. It should be able to execute any task regardless of the input-output modality.

OSWorld Task Breakdown

OSWorld contains 412 tasks across real desktop environments. Each task gives an agent a natural language instruction and a live desktop. The agent must complete the task using only screenshots and mouse/keyboard input.

Ubuntu: 369 tasks across 10 categories

Windows: 43 tasks across 4 categories

Ubuntu tasks cover browser automation, image editing, office productivity, system administration, email, media playback, and code editing. Windows tasks focus on Office applications and cross-application workflows. Every task includes a programmatic evaluator: not LLM-as-judge, but deterministic comparison of file contents, UI state, or application output.

Technical Implementation

System Architecture

The Adapter: OSWorld → Harbor Tasks

The adapter (adapters/osworld/adapter.py) reads OSWorld's task index (test_all.json for Ubuntu, test_windows.json for Windows) and generates one Harbor task directory per task:

win_excel__3aaa4e37-dc91-482e-99af-132a612d40f3/
├── task.toml           # timeouts, resource limits, os_type
├── instruction.md      # natural language task for the agent
├── environment/
│   └── Dockerfile      # unused for QEMU/Daytona, kept for compatibility
└── tests/
    ├── test.py         # runs eval_runner → writes reward.txt
    └── task_config.json # OSWorld evaluator config (metric, getter, expected)

task_config.json carries the original OSWorld evaluation specification: which metric function to use (compare_csv, compare_table, exact_match, etc.), how to extract the current state from the desktop (getter), and the expected result. The test script invokes the eval runner inside the VM, which resolves the metric, runs the getter, compares, and writes a binary score (0 or 1) to /logs/verifier/reward.txt.

The Agents

We built three CUA agents. All three run the same loop - take screenshot, send to model, execute returned actions - and interact with the desktop exclusively through DesktopInterface, making them environment-agnostic.

**anthropic-cua** uses Claude’s Computer Use beta. Auto-detects the model tier: Opus 4.6 / Sonnet 4.6 get adaptive thinking (effort: "high"), the zoom action, and computer_20251124; older models fall back to computer_20250124 without thinking. System prompt includes self-verification: “after each step, take a screenshot and evaluate the outcome.”
**openai-cua** uses GPT-5.4’s Responses API with the GA computer tool. Uses previous_response_id for conversation continuity (no manual message history), returns batched actions (multiple per turn), and sets reasoning: {"effort": "high"} on every call. Handles pending_safety_checks by acknowledging them in the next turn.
**pyautogui-cua** is model-agnostic: any vision LLM (via the Tinker API or LiteLLM) generates raw pyautogui Python code from screenshots. No provider-specific tool schemas — the LLM sees a screenshot and writes pyautogui.click(450, 320). On Daytona, the code is parsed into AST and dispatched through DesktopInterface; on QEMU, it’s executed directly via the HTTP /execute endpoint. Tested with Qwen3.5-397B and Kimi K2.5. The Tinker backend also collects token IDs + logprobs for RL training. Install with uv sync --extra tinker.

All three agents compress screenshots to JPEG (quality 60) before sending to prevent context windows from growing unboundedly.

DesktopInterface: The Agent-Environment Contract

The three agents are decoupled from the environment through DesktopInterface. This is the only API an agent needs to implement a CUA loop:

To add a new CUA agent (e.g. UI-TARS, FDM-1): implement BaseAgent, register in AgentName enum and factory.py, and call DesktopInterface methods in your loop. The agent never needs to know whether it's running on QEMU or Daytona. For open-weight models, pyautogui-cua with Tinker already provides this — just pass a different model_name.

Daytona: Cloud Sandbox Execution

For Daytona, the environment creates an ephemeral sandbox per trial from a base snapshot (ubuntu-large for Ubuntu, windows-base for Windows). The DesktopInterface wraps Daytona's computer_use SDK:

screenshot_bytes = await sandbox.computer_use.screenshot.capture()
await sandbox.computer_use.mouse.click(x, y, button="left")
await sandbox.computer_use.keyboard.type(text)
await sandbox.computer_use.recording.start()

This is the same DesktopInterface API the agent calls. The agent doesn't know or care whether it's talking to a local QEMU VM or a Daytona sandbox in the cloud.

For Ubuntu, a shell setup script installs applications and packages at sandbox creation (~2-5 min). For Windows, a Python setup script installs evaluation packages (openpyxl, pandas, lxml, etc.) and ffmpeg, and Harbor deploys the OSWorld desktop_env evaluators with safe import wrappers that gracefully skip heavy dependencies like easyocr or librosa that aren't needed for most task categories.

QEMU/KVM: Bare-Metal Execution

Each trial boots a QEMU VM from a copy-on-write overlay on a pre-baked base image:

qemu-img create -f qcow2 -b ubuntu.qcow2 -F qcow2 overlay_trial_xyz.qcow2
qemu-system-x86_64 -enable-kvm -cpu host -m 4G -smp 1 \
    -drive file=overlay_trial_xyz.qcow2 -nographic \
    -net user,hostfwd=tcp::15000-:5000

The VM runs a Flask HTTP server on port 5000 with two endpoints: /screenshot (returns PNG) and /execute (runs a command, returns JSON with stdout/stderr/returncode). On Ubuntu, xdotool handles mouse and keyboard input. On Windows, pyautogui does the same through the execute endpoint. Screen recording uses ffmpeg with x11grab (Ubuntu) or gdigrab (Windows).

The bake step (scripts/osworld/bake-qcow2.sh) is critical: it boots the base VM once, installs all evaluator dependencies (the desktop_env package from OSWorld, Playwright with Chromium, xdotool, Python packages), configures applications (Chrome remote debugging, VLC HTTP interface, LibreOffice save formats), and commits changes back to the qcow2. All subsequent COW overlays inherit these dependencies without re-installing.

Concurrent trials each get their own overlay and unique port allocation, so 20–50 VMs can run simultaneously on a single bare-metal server without interfering with each other.

Trajectory Recording and Viewer

Every trial produces:

ATIF trajectory (trajectory.json): structured record of every agent step: screenshot, model response, action, timing, token usage
Screen recording (recording.mp4): continuous video of the desktop during the trial
Agent logs (trial.log): environment setup, task setup steps, errors, verifier output
Verifier output (verifier/output.txt, verifier/reward.txt): evaluator logs and the final binary score

Harbor's built-in viewer (harbor view) serves a web UI that renders trajectories with step-by-step screenshots, action overlays, token usage breakdowns, recording playback, and side-by-side comparison across trials. This is essential for debugging agent failures: you can see exactly which screenshot the model misinterpreted, which click missed its target, or where the agent entered an unrecoverable loop.

Running It

Install from source first (see full README):

git clone https://github.com/Mascobot/harbor.git && cd harbor
uv cache clean harbor && uv tool install --force --prerelease=allow .

Set the API keys you need:

export ANTHROPIC_API_KEY=sk-ant-...   # for anthropic-cua
export OPENAI_API_KEY=sk-...          # for openai-cua
export TINKER_API_KEY=tml-...         # for pyautogui-cua with Tinker
export DAYTONA_API_KEY=dtn_...        # for Daytona runs

Replace anthropic-cua with openai-cua for GPT-5.4, or use pyautogui-cua with any vision model via Tinker. Add --model anthropic/claude-opus-4-6 to use Opus 4.6.

Single Ubuntu task on Daytona

# Claude (default: Sonnet 4.5)
harbor run --config examples/configs/osworld-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    -t os__94d95f96-9699-4208-98ba-3c3119edf9c2

# GPT-5.4
harbor run --config examples/configs/osworld-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    -t os__94d95f96-9699-4208-98ba-3c3119edf9c2 \
    --agent openai-cua

# Claude Opus 4.6
harbor run --config examples/configs/osworld-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    -t os__94d95f96-9699-4208-98ba-3c3119edf9c2 \
    --agent anthropic-cua --model anthropic/claude-opus-4-6

Single Windows task on Daytona

harbor run --config examples/configs/osworld-windows-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks_windows \
    -t win_excel__3aaa4e37-dc91-482e-99af-132a612d40f3 \
    --agent openai-cua

Full benchmark on Daytona (369 Ubuntu + 43 Windows)

# Ubuntu - 10 concurrent sandboxes
harbor run --config examples/configs/osworld-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    --n-concurrent 10 --agent anthropic-cua

# Windows - 4 concurrent sandboxes
harbor run --config examples/configs/osworld-windows-daytona-job.yaml \
    --path ~/.harbor/data/osworld/tasks_windows \
    --n-concurrent 4 --agent openai-cua

PyAutoGUI CUA + Tinker on QEMU

QEMU sections require a one-time host setup and image bake:

bash scripts/osworld/bare_metal/setup.sh
bash scripts/osworld/bare_metal/bake/ubuntu.sh

# Single task — Qwen3.5-397B + Kimi K2.5 (runs both back-to-back)
harbor run -c examples/configs/pyautogui-cua-qemu-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    -t "os__94d95f96-9699-4208-98ba-3c3119edf9c2"

# Full benchmark — 16 concurrent VMs
harbor run -c examples/configs/pyautogui-cua-qemu-job.yaml \
    --path ~/.harbor/data/osworld/tasks \
    --n-concurrent 16

Requires uv sync --extra tinker and TINKER_API_KEY.

Full benchmark on QEMU (bare-metal)

# Ubuntu - 20 concurrent VMs
harbor run --path ~/.harbor/data/osworld/tasks \
    --n-concurrent 20 --agent anthropic-cua --env qemu

# Windows - 10 concurrent VMs
harbor run --path ~/.harbor/data/osworld/tasks_windows \
    --n-concurrent 10 --agent openai-cua --env qemu

View results

harbor view --host 0.0.0.0 -p 8080 jobs/

Future Work

Speed: The bottleneck is environment setup, not agent execution. Daytona Windows takes ~4 min per sandbox for pip installs (pre-baked snapshots would fix this). QEMU Windows VMs take ~60s to boot vs ~20s for Ubuntu due to UEFI firmware.

More CUA agents: DesktopInterface is agent-agnostic. Adding UI-TARS, FDM-1, or any other CUA model means implementing one agent loop. For open-weight models, pyautogui-cua with Tinker already works out of the box.

Non-GUI agents on GUI benchmarks: Can Claude Code or OpenHands or other non-visual agents solve OSWorld tasks through terminal-only interaction? Many tasks (spreadsheets, file management, app config) don’t strictly require a GUI. This would measure how much of “computer use” actually needs visual understanding.

Other benchmarks: WebArena, BearCubs, and WebChoreArena can run on the same QEMU/Daytona infrastructure. The adapter layer converts any benchmark’s task format into Harbor’s task model.

RL optimization: The real goal is generating training signal. ATIF trajectories capture state, action, reward, and next state. The Tinker backend collects per-token logprobs and supports custom weight checkpoints (tinker://run-id/weights/checkpoint-001), enabling DPO, GRPO, or filtered BC on thousands of parallel rollouts.

Category: AI Research – Tags: ai, benchmarks, computer-use, harbor, osworld, rl