LLM Comparison Table
All 16 models. Scroll right for benchmarks. Highlighted = best in column. Model column is frozen — scroll freely.
Best in column* = estimated or thinking-mode scoreCN ⚠ = Chinese jurisdictionAPI $/1M = blended at 3:1 input:outputHow we score →
| # | Model | Score | AA Index | Context | Speed t/s | API $/1M | Free? | Open? | Multi? | Trust | GPQA ⬦ | HLE ⬦ | SWE-bench ⬦ | MMLU-Pro ⬦ | Released |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Google top-pick | 9.0/10 | 57 | 1M | 108.6 | $4.50 | ± ltd | ✗ | ✓ | US/EU | 94.1% | 44.4% | 80.6% | — | 2026-02-19 |
| 2 | Gemini 3 Pro Google best-value | 7.9/10 | 48.44 | 1M | 55* | $4.50 | ✓ full | ✗ | ✓ | US/EU | 91.9% | 37.5% | 76.2% | 90.1% | 2025-11-18 |
| 3 | GPT-5.2 OpenAI top-pick | 7.4/10 | 46.58 | 400K | 65* | $4.81 | ± ltd | ✗ | ✓ | US/EU | 90.3% | 28.0%* | 80.0% | 87.5% | 2025-12-11 |
| 4 | Grok 4.1 xAI | 7.4/10 | 41.43 | 2M | 90* | $6.00 | ± ltd | ✗ | ✓ | US/EU | 88.0%* | 45.0%* | 75.0%* | 86.6% | 2025-11-17 |
| 5 | Claude Sonnet 4.6 Anthropic | 7.1/10 | 44.33 | 200K | 85* | $6.00 | ± ltd | ✗ | ✓ | US/EU | 89.9%* | 33.2% | 79.6% | 89.3%* | 2026-02-17 |
| 6 | Llama 4 Scout Meta open-source | 6.9/10 | 38.5* | 10M | 180 | $0.11 | — | ✓ | ✓ | US/EU | — | — | — | — | 2025-04-05 |
| 7 | Gemini 3 Flash Google fastest | 6.8/10 | 35 | 1M | 170 | $1.13 | ✓ full | ✗ | ✓ | US/EU | 90.4% | — | 78.0%* | 88.6% | 2025-12-17 |
| 8 | Claude Opus 4.6 Anthropic top-pick | 6.6/10 | 46 | 200K | 67 | $10.00 | — | ✗ | ✓ | US/EU | 91.3% | 40.0% | 80.8% | — | 2026-02-05 |
| 9 | GPT-5 mini OpenAI best-value | 6.5/10 | 39 | 400K | 73 | $0.69 | ± ltd | ✗ | ✓ | US/EU | — | — | 74.9%* | — | 2025-08-07 |
| 10 | Claude Haiku 4.5 Anthropic best-value | 5.6/10 | 31 | 200K | 108.8 | $2.00 | ± ltd | ✗ | ✓ | US/EU | — | — | — | — | 2025-10-15 |
| 11 | DeepSeek V3.2 DeepSeek | 5.5/10 | 41.61 | 128K | 45* | $0.48 | ✓ full | ✓ | ✗ | CN ⚠ | 82.4% | — | — | 85.0% | 2025-12-01 |
| 12 | GPT OSS 120B OpenAI open-source | 5.4/10 | 33 | 131K | 336 | $0.26 | ± ltd | ✓ | ✗ | US/EU | 80.9%* | — | — | — | 2025-08-05 |
| 13 | Kimi K2 Moonshot AI open-source | 5.0/10 | 31 | 262K | 62.3 | $0.77 | ± ltd | ✓ | ✗ | CN ⚠ | 84.5%* | — | — | 84.6%* | 2025-09-05 |
| 14 | Mistral Large 3 Mistral | 4.4/10 | 23 | 256K | 56 | $0.75 | ± ltd | ✓ | ✓ | US/EU | — | — | — | 85.5%* | 2025-12-02 |
| 15 | Llama 4 Maverick Meta open-source | 4.4/10 | 18 | 1M | 125 | $0.44 | — | ✓ | ✓ | US/EU | 69.8% | — | — | 80.5% | 2025-04-05 |
| 16 | Qwen 3 235B Alibaba open-source | 3.3/10 | 17 | 262K | 40.7 | $0.37 | ± ltd | ✓ | ✗ | CN ⚠ | 81.1%* | — | — | — | 2025-04-29 |
⬦ = benchmark columns. * = estimated or thinking/reasoning-mode score — treat as approximate. | GPQA Diamond: 198 PhD-level science Qs (random = 25%). HLE: Humanity’s Last Exam — hardest public benchmark. SWE-bench: real GitHub issues resolved. MMLU-Pro: 12K graduate-level Qs, 10 choices.
Score = composite quality rating 0–10. AA Index = Artificial Analysis Intelligence Index v4.0. API $/1M = blended at 3:1 input:output. Full methodology →
Last verified February 2026. Benchmark scores sourced from Artificial Analysis, official model announcements, Vellum, and technical reports.