Every current frontier and efficient large language model — Claude, GPT-5, Gemini, Grok, DeepSeek, Qwen, Mistral and the best open-weight models — scored on the jobs people actually use them for: raw intelligence, generation speed and context length, then ranked by the metric that matters most — capability per dollar of tokens. The headline score is led by independent Artificial Analysis Intelligence Index benchmarks, not a vendor's launch-day chart. The smartest model isn't always the smart buy — a $0.18/M open-weight model can out-value a flagship many times over.
Start with the job — it drives every chart below
Each bubble is a model. Right = more capable, down/left = cheaper per million tokens — the top-left is the value sweet spot. Color = lab. Capability is an intelligence-led index (see Method); price is the blended cost per 1M tokens. The frontier models cluster top-right (smart but pricey); the open-weight models sit far left at a fraction of the cost. Pick a job above to filter the field.
Models (warm) linked to the capabilities they bring (cyan) — vision, audio and video input, tool use / agents, reasoning mode, 1M-token context, and open weights. Drag a node; hover to trace a model's stack. Built from each model's documented feature support.
Each model is scored on three measured axes, summed and normalized to 0–100. Capability is deliberately intelligence-dominant (~88%) so the smartest model leads — speed and context only break ties within an intelligence band, never across a real intelligence gap:
intelligence the Artificial Analysis Intelligence Index (their composite of reasoning, math, science & coding evals) — 61+ class=top … ~33 floor | speed measured output tokens/sec — 300+=top … ~44 floor | context max input window — 1M+=top … 131K floor.
Value = capability² ÷ blended price (per 1M tokens), so a model is rewarded for capability and punished for cost — which is why open-weight models like DeepSeek V4 and MiMo-V2.5 (~$0.18/M) top the value chart, and Gemini 3.1 Pro is the best-value frontier model. Measured beats claimed: the intelligence score is an independent benchmark index, not a vendor slide.
Honesty — accuracy is priority #1. Every number was hunted on 2026-06-03 from public leaderboards (Artificial Analysis for intelligence, knowledge, speed, latency & price; LLM-Stats and Vellum for coding, agentic and GPQA scores). Coding, agentic, knowledge (Omniscience), GPQA and latency are shown per model where independently measured but are NOT scored axes — their coverage is partial, and flooring an un-measured model would be unfair. Blended prices marked “est.” are a 3:1 input:output estimate where a published blended price wasn’t listed; Claude Opus 4.7’s current price wasn’t on the leaderboards, so its value is left blank rather than guessed. This is a fast-moving field — figures are as of the date above; unreleased/preview-only models (e.g. a Mythos-class preview) are excluded until broadly accessible. Nothing here is fabricated.