April 1, 2026

What OSWorld Is and How AI Models Are Rated on It

Most AI benchmarks test language. They ask a model to answer a question, summarize a passage, or solve a math problem in a text box. OSWorld tests something different: whether an AI can actually use a computer.

What OSWorld is

OSWorld is a benchmark introduced in 2024 that evaluates AI agents on real computer tasks. Instead of presenting a model with a prompt and expecting a text response, it drops the model into a live desktop environment — running Ubuntu, Windows, or macOS — and assigns it a task. The model sees the screen as a stream of screenshots and must interact with it using simulated mouse clicks, keyboard inputs, and navigation, just as a human would.

The tasks span everyday computer use: opening a spreadsheet and computing a formula, finding a file in a directory and attaching it to an email, editing a configuration in a settings panel, or navigating a browser to complete a form. These are drawn from real applications including Chrome, LibreOffice, VS Code, and GIMP, covering over 369 distinct tasks across nine or more application categories.

The benchmark was designed by researchers to expose a gap that language-only evaluations miss entirely: the ability to perceive a visual interface, plan a sequence of actions, and execute them correctly in a dynamic environment.

How models are rated

Scoring in OSWorld is based on task completion, not process. After a model attempts a task, an automated function checks whether the final state of the computer matches what was required. Did the file get saved to the right location? Is the correct value in the correct cell? Was the email sent? There is no partial credit. Either the task succeeded or it did not.

This makes OSWorld harder to game than benchmarks where a model can produce a plausible-sounding answer without actually solving the problem. Because the environment is real and the evaluation checks real outcomes, the model has to plan correctly, recover from mistakes mid-task, and reach the right end state. Generating confident text is not enough.

Human testers complete around 72% of OSWorld tasks successfully, which serves as the practical ceiling for the benchmark.

How AI has progressed on it

When OSWorld launched, the best available models scored in the low single digits. GPT-4V, one of the strongest vision-capable models at the time, completed only around 5% of tasks. The gap relative to human performance was striking, and it made clear that strong language and reasoning ability does not automatically transfer to GUI-based computer use.

Progress since then has been substantial. By late 2024, models like Claude 3.5 Sonnet with computer use capabilities were reaching around 22% on the benchmark. Specialized agent systems built on top of frontier models pushed scores further, with top systems now clearing roughly 60 to 75% of tasks as of 2026. That is a significant improvement over a short period, driven by better visual grounding, smarter action planning, and agent scaffolding that allows models to recover from errors mid-task.

Why it matters as a benchmark

OSWorld matters because it shifts the frame for what it means for AI to be capable. A model that can write a persuasive essay or solve a logic puzzle has demonstrated narrow skills. A model that can sit down at a computer and get things done is demonstrating something closer to general usefulness.

The benchmark also illustrates something important about how AI progress gets measured. Good evaluations are hard to inflate. Because OSWorld checks actual outcomes in a real environment, improvements in scores correspond to real improvements in capability. A model that goes from 5% to 45% is not just better at predicting what answer a human would give — it is actually completing nearly ten times as many computer tasks correctly.

March 19, 2026

Solid state chemistry studies how atoms, ions, or molecules are arranged in solids and how that arrangement determines a material’s properties. Rather than focusing only on chemical composition, it emphasizes the structure of the molecule, especially the repeating three-dimensional patterns found in crystal lattices. Concepts like the unit cell, coordination number, and packing of particles are used to describe how a solid is built on the atomic level, and what form it takes on when unit cells are combined together.

This field is important because the structure of a solid directly affects its behavior. The arrangement of particles can influence hardness, melting point, electrical conductivity, magnetism, and stability. For example, metals conduct electricity well because of their bonding and electron mobility, while ionic solids are often brittle because shifting the layers brings like charges next to each other. Even small crystal defects, such as missing ions or substituted atoms, can change the color, conductivity, or strength of a material. Solid state chemistry therefore explains why solids with different internal structures can have very different physical properties.

March 17

A duopoly is a market where only two firms dominate and each firm’s decisions affect the other. In a duopoly, companies might choose strategies like setting a high price, lowering prices, increasing output, or limiting production. A payoff matrix helps show the profit each firm earns depending on what both firms choose.

What makes this important is that neither firm can make decisions independently. Each one has to think about how its rival will react. For example, if both firms keep prices high, they may both earn strong profits. But if one firm lowers its price while the other keeps prices high, the firm that cuts prices may attract more customers and earn more, while the other loses profit. If both firms cut prices, though, they may both end up worse off.

This is why duopolies are often studied in game theory. A payoff matrix can reveal whether firms have a dominant strategy and whether they end up at a Nash equilibrium, where neither firm wants to change its choice given the other firm’s decision. Overall, payoff matrices make it easier to see why competition in a duopoly can push firms toward outcomes that are not always best for either side.

Turns out that in any case, assuming that you do not know what action the other firm decides on, it is always better to defect and lower your prices.

March 16

Kinetics is the study of the rate of chemical processes. There are 3 different orders of rates: zeroth, first, and second. Second rate reactions can be once again split into those with a singular reactant and those with two reactants.

The functional rate laws for these orders are found by integrating the differential rate laws based on the disappearance of reactant A and the concentration of product B. This is called the integrated rate law.

The integrated rate law for zeroth, first, and second order reactions are the following :

The rate constant k is one of few chemical constants that have varying units in different cases. These rate laws can be used to find the half life of various reactants in a reaction, calculated theoretical product yield at a certain time point, and even determine the rate determining step of a process with various intermediates.

March 14, 2026

Neuroscience is surprisingly connected to mathematics. In many neuroscience studies, you record thousands of data points all at once. To find patterns within this seemingly sporadic data, neuroscientists utilize a PCA (Principle Component Analysis).

The way PCA works is by computing eigenvectors of a covariance matrix. Eigenvectors are special types of vectors that only stretch or shrink, without changing directions. The degree to which these vectors are stretched or shrunk is called the eigenvalue, denoted by the greek letter lambda. Eigenvectors reveal the natural coordinate system of another system. It depicts the overall direction in which a system wants to veer towards. The details of how PCA utilizes eigenvectors to analyze neuroscience data is a bit beyond my understanding. But, there always exists an interesting connection between the life sciences, mathematics, and computational statistics.

March 13, 2026 (a lil late)

Firms are business entities that produce goods with raw material, capital, and labor. Capital is considered a fixed cost as this pertains to all the things that are bought and secured at the beginning of firm’s production. The cost of raw materials increase directly proportional to the amount of goods produced. However, by the nature of economics, as firms invest more and more labor into the production of a good, the marginal cost increases, causing it to be worth less.

To portray these effects and the average variable cost, average fixed cost, and average total cost of a production line, a cost curve graph is used.

As seen, the MC starts to decrease and suddenly begins to increase. Although the marginal cost is increasing, the average variable cost and average total cost still decreases as long as the marginal cost is lower than those values. As a result, when the marginal cost passes the AVC and ATC, the AVC and ATC values start to go up as well. As a result, the point at which the MC curve intersects with the AVC and ATC curves is the minimum point for both cost curves. Finally, the average fixed cost graph is shown as constantly decreasing. This is logical since the graph depicts a fixed cost divided by an increasing number of products.

One final thing to note about the graph is that the ATC is a sum of the AVC and AFC graph. Therefore, as the number of products increase, the AVC and ATC graph will infinitely grow closer to each other. However, the permanently decreasing, yet existing, average fixed cost will prevent the two cost curves from ever intersecting.

Such is how modern economics graphically lays out the cost of a firm producing a certain good.