[good]

LLM Comparison Table

All 16 models. Scroll right for benchmarks. Highlighted = best in column. Model column is frozen — scroll freely.

Best in column* = estimated or thinking-mode scoreCN ⚠ = Chinese jurisdictionAPI $/1M = blended at 3:1 input:outputHow we score →
#ModelScoreAA IndexContextSpeed t/sAPI $/1MFree?Open?Multi?TrustGPQA ⬦HLE ⬦SWE-bench ⬦MMLU-Pro ⬦Released
1Gemini 3.1 Pro
Google
top-pick
9.0/10571M108.6$4.50± ltdUS/EU94.1%44.4%80.6%2026-02-19
2Gemini 3 Pro
Google
best-value
7.9/1048.441M55*$4.50✓ fullUS/EU91.9%37.5%76.2%90.1%2025-11-18
3GPT-5.2
OpenAI
top-pick
7.4/1046.58400K65*$4.81± ltdUS/EU90.3%28.0%*80.0%87.5%2025-12-11
4Grok 4.1
xAI
7.4/1041.432M90*$6.00± ltdUS/EU88.0%*45.0%*75.0%*86.6%2025-11-17
5Claude Sonnet 4.6
Anthropic
7.1/1044.33200K85*$6.00± ltdUS/EU89.9%*33.2%79.6%89.3%*2026-02-17
6Llama 4 Scout
Meta
open-source
6.9/1038.5*10M180$0.11US/EU2025-04-05
7Gemini 3 Flash
Google
fastest
6.8/10351M170$1.13✓ fullUS/EU90.4%78.0%*88.6%2025-12-17
8Claude Opus 4.6
Anthropic
top-pick
6.6/1046200K67$10.00US/EU91.3%40.0%80.8%2026-02-05
9GPT-5 mini
OpenAI
best-value
6.5/1039400K73$0.69± ltdUS/EU74.9%*2025-08-07
10Claude Haiku 4.5
Anthropic
best-value
5.6/1031200K108.8$2.00± ltdUS/EU2025-10-15
11DeepSeek V3.2
DeepSeek
5.5/1041.61128K45*$0.48✓ fullCN ⚠82.4%85.0%2025-12-01
12GPT OSS 120B
OpenAI
open-source
5.4/1033131K336$0.26± ltdUS/EU80.9%*2025-08-05
13Kimi K2
Moonshot AI
open-source
5.0/1031262K62.3$0.77± ltdCN ⚠84.5%*84.6%*2025-09-05
14Mistral Large 3
Mistral
4.4/1023256K56$0.75± ltdUS/EU85.5%*2025-12-02
15Llama 4 Maverick
Meta
open-source
4.4/10181M125$0.44US/EU69.8%80.5%2025-04-05
16Qwen 3 235B
Alibaba
open-source
3.3/1017262K40.7$0.37± ltdCN ⚠81.1%*2025-04-29

= benchmark columns. * = estimated or thinking/reasoning-mode score — treat as approximate. | GPQA Diamond: 198 PhD-level science Qs (random = 25%). HLE: Humanity’s Last Exam — hardest public benchmark. SWE-bench: real GitHub issues resolved. MMLU-Pro: 12K graduate-level Qs, 10 choices.

Score = composite quality rating 0–10. AA Index = Artificial Analysis Intelligence Index v4.0. API $/1M = blended at 3:1 input:output. Full methodology →

Last verified February 2026. Benchmark scores sourced from Artificial Analysis, official model announcements, Vellum, and technical reports.