Research

How we used Nen to further AI research

N
Nen Team
May 5, 2026

Two weeks ago, the CRUX (Collaborative Research for Updating AI eXpectations) project (link) led by researchers from Princeton University ran their inaugural experiment in which an AI agent built and published an app to the iOS App Store end-to-end, with only one human intervention. It showed that frontier agents could autonomously ship a real app to a real app store.

CRUX-Windows is the follow-up: could an agent do the same on Windows?

Claude Opus 4.7, given a developer account and a one-page brief, built a time-zone overlap app, and shipped it live (link) to the Microsoft Store through a Nen desktop. The agent handled everything from code to submission in 78 hours of wall-clock time and three human inputs.

This post is about the platform that made the run possible without turning the evaluation into a Windows infrastructure exercise.

A laptop on a desk does not scale to parallel agent runs

The Microsoft Store toolchain, including Visual Studio, the Windows App SDK, MSIX packaging, and the Partner Center submission flow, only runs on Windows. The naive way to handle that is to buy a Windows laptop, sit it on a desk, and let the agent drive it. CRUX #1, the predecessor experiment, did exactly this on macOS with a Mac Mini. One physical machine gives you one experiment at a time, tied to one Windows build, with no clean snapshot or restore path.

Nen gave us something different. OpenClaw, our agent harness, ran in a sandbox and drove a Nen-hosted Windows VM through HTTP. The human experiment operator could check in on the Windows desktop through a browser tab, but otherwise did not intervene.

CRUX-Windows agent driving a Nen-hosted Windows VM

The agent <> Windows connection stayed warm across every tool call, and the supervisor could walk away and pick up watching whenever.

Nen exposes computer_tool on Windows VMs so your agent can drive it

Concretely, Nen runs as a Go binary on the controller, exposes an HTTP API on :8600, and presents a single uniform interface for "I have a Windows desktop, please click on it":

POST /actions?model=claude-opus-4-7&desktop=win
{"type":"computer_20250124","action":"left_click","coordinate":[500,300]}

Behind that endpoint, Nen absorbs five things that otherwise become the user's problem.

The model adapter layer. Anthropic, OpenAI, and Gemini each define "computer use" tool calls differently, and Nen parses the model's native format, executes the canonical action, and formats the result back in the shape the model expects. Swapping Opus 4.7 for GPT-5 is a query-string flip:

# Opus 4.7
POST /actions?model=claude-opus-4-7&desktop=win
{"type":"computer_20250124","action":"left_click","coordinate":[500,300]}

# GPT-5 (same action, Nen translates)
POST /actions?model=gpt-5&desktop=win
{"type":"computer_call","action":{"type":"click","x":500,"y":300}}

A desktop that's still there when you come back. Windows locks the screen on every RDP disconnect, and the next screenshot captures LogonUI.exe instead of your desktop. On top of that, every PowerShell command run through the default RDP session pops a visible console window in the framebuffer your agent is screenshotting, which leaves visual junk in the run and confuses the model. Nen keeps the RDP session open across tool calls so the desktop is there when the agent comes back, and runs shell commands through a separate channel so the screenshots stay clean. Kick off a 78-hour run, walk away, come back to a working desktop — that's Nen's promise.

RDP from a non-Windows controller. A Linux controller drives the VM over RDP, tunneled through guacd (Apache Guacamole's protocol daemon) instead of embedding FreeRDP. The viewer is a browser page streaming the same tunnel over WebSocket.

File transfer in both directions. Credentials go onto the VM, build logs come back off. Nen mounts a shared folder between host and guest, with permissions sorted so non-root processes can write to it, which saves you from rigging up SCP keys or an S3 sync for every run.

Simple lifecycle management. Use nen desktop up, nen desktop stop, and nen desktop destroy on Windows VMs. We reprovisioned the experiment VM twice during dry runs without rebuilding any of the harness on top.

All of this sounds like something you could build on top of EC2. Spin up a Windows AMI, RDP in, point your agent at it. However, you would still need to build a model adapter for Anthropic, OpenAI, and Gemini formats, RDP keepalive so disconnects do not leave your screenshot staring at LogonUI.exe, a side channel for shell so PowerShell consoles stay out of the framebuffer, and more. Also, since Nen has a pre-warmed pool, time to first pixel is seconds. Better spend your time building agents rather than maintain a Windows automation stack.

We open-sourced the Windows primitive as dexbox because plumbing should not be a moat

The Windows desktop primitive underneath Nen is open source as dexbox. Same binary, same interfaces; Nen is the productized stack we run, host, and support on top of it. Two reasons we keep the primitive in the open:

  1. The interesting work in agentic computer use is happening at the model layer and the agent-scaffolding layer. We don't want every team trying to reproduce a CRUX-style experiment to first re-derive RDP session management.
  2. Research moves faster when the substrate is shared. CRUX-Windows reproduced one finding from CRUX #1 (the agent can ship an app) and surfaced a second one (the cost-optimization result was scaffolding-driven, not capability-driven). Both of those are easier for the next team to pick up if they don't have to rebuild the controller stack first.

The research got to just be research

With the infrastructure out of the way, the agent diagnosed an MSBuild namespace bug from a 6.3 MB log. It read a Microsoft Store policy rejection on branded tile icons, generated the missing assets, and resubmitted. It noticed its own SCP wrapper was corrupting transfers and rewrote it mid-run. Those are the moments worth studying. Zi's CRUX-Windows writeup covers them in detail. This post is about the 78 hours of nothing else getting in their way.