Methodology

The Benchmark Matrix on our home page is a public assertion of what frontier model capability looks like right now. This page explains where those numbers come from, how we scale them, and what we do when the data is incomplete or changes shape. The short version is simple: every value is sourced, nothing is invented, and when we are uncertain we say so on the matrix itself rather than paper over the gap.

Where the numbers come from

Every score in the matrix is drawn from Artificial Analysis, an independent evaluator that runs frontier models against a fixed battery of public benchmarks. We refresh our copy twice a day and stamp each update with the time it was fetched, so the “Updated… ago” note beside the matrix reflects that fetch, not the moment you happened to load the page. We do not re-run the benchmarks ourselves, and we do not blend in scores from other leaderboards.

The six axes

The matrix tracks six capabilities, each reported on its native zero-to-one accuracy scale exactly as Artificial Analysis measures it. GPQA covers graduate-level question answering. HLE is Humanity’s Last Exam. SciCode measures scientific coding, and IFBench measures how faithfully a model follows instructions. The τ² benchmark probes agentic, tool-using behaviour, and Terminal-Bench Hard tests real work inside a command-line environment. A model scoring 0.62 on an axis answered roughly sixty-two percent of that benchmark correctly.

The composite score

The right-most column, the Intelligence Index, is Artificial Analysis’s own overall score on their internal scale. Because it is partly derived from the six axes to its left, we show it for context only and never render it as a bar. Where Artificial Analysis flags a composite as an estimate rather than a measured result, we carry that mark through as an asterisk rather than dressing an estimate up as a measurement.

How the bars are scaled

Each mini-bar is scaled against a fixed reference of zero to one for its column, never against the strongest model in the table. This matters more than it sounds. A relative scale would make every refresh look like a dramatic reshuffle and would quietly flatter whichever models happen to be listed that day. A fixed scale means a bar that reads half-full carries the same meaning this month as it did last.

Missing and changing data

Not every model is evaluated on every benchmark. When a score does not exist we leave a gap. We never impute a value, average one in, or carry a neighbouring number across to fill the hole. Artificial Analysis also re-baselines its benchmark set from time to time, retiring or renaming tests as the field moves, and we watch for that. If a benchmark we depend on disappears, we keep showing the last known good values and flag the matrix with a visible notice that the set may be out of date, rather than silently dropping a column or guessing at a replacement.

Corrections

If you believe a figure is wrong, tell us through the contact page. We would far rather fix a sourced number than defend it.