April 1, 2026

What OSWorld Is and How AI Models Are Rated on It

Most AI benchmarks test language. They ask a model to answer a question, summarize a passage, or solve a math problem in a text box. OSWorld tests something different: whether an AI can actually use a computer.

What OSWorld is

OSWorld is a benchmark introduced in 2024 that evaluates AI agents on real computer tasks. Instead of presenting a model with a prompt and expecting a text response, it drops the model into a live desktop environment — running Ubuntu, Windows, or macOS — and assigns it a task. The model sees the screen as a stream of screenshots and must interact with it using simulated mouse clicks, keyboard inputs, and navigation, just as a human would.

The tasks span everyday computer use: opening a spreadsheet and computing a formula, finding a file in a directory and attaching it to an email, editing a configuration in a settings panel, or navigating a browser to complete a form. These are drawn from real applications including Chrome, LibreOffice, VS Code, and GIMP, covering over 369 distinct tasks across nine or more application categories.

The benchmark was designed by researchers to expose a gap that language-only evaluations miss entirely: the ability to perceive a visual interface, plan a sequence of actions, and execute them correctly in a dynamic environment.

How models are rated

Scoring in OSWorld is based on task completion, not process. After a model attempts a task, an automated function checks whether the final state of the computer matches what was required. Did the file get saved to the right location? Is the correct value in the correct cell? Was the email sent? There is no partial credit. Either the task succeeded or it did not.

This makes OSWorld harder to game than benchmarks where a model can produce a plausible-sounding answer without actually solving the problem. Because the environment is real and the evaluation checks real outcomes, the model has to plan correctly, recover from mistakes mid-task, and reach the right end state. Generating confident text is not enough.

Human testers complete around 72% of OSWorld tasks successfully, which serves as the practical ceiling for the benchmark.

How AI has progressed on it

When OSWorld launched, the best available models scored in the low single digits. GPT-4V, one of the strongest vision-capable models at the time, completed only around 5% of tasks. The gap relative to human performance was striking, and it made clear that strong language and reasoning ability does not automatically transfer to GUI-based computer use.

Progress since then has been substantial. By late 2024, models like Claude 3.5 Sonnet with computer use capabilities were reaching around 22% on the benchmark. Specialized agent systems built on top of frontier models pushed scores further, with top systems now clearing roughly 60 to 75% of tasks as of 2026. That is a significant improvement over a short period, driven by better visual grounding, smarter action planning, and agent scaffolding that allows models to recover from errors mid-task.

Why it matters as a benchmark

OSWorld matters because it shifts the frame for what it means for AI to be capable. A model that can write a persuasive essay or solve a logic puzzle has demonstrated narrow skills. A model that can sit down at a computer and get things done is demonstrating something closer to general usefulness.

The benchmark also illustrates something important about how AI progress gets measured. Good evaluations are hard to inflate. Because OSWorld checks actual outcomes in a real environment, improvements in scores correspond to real improvements in capability. A model that goes from 5% to 45% is not just better at predicting what answer a human would give — it is actually completing nearly ten times as many computer tasks correctly.

Leave a comment