LLM Comparison Table

All 16 models. Scroll right for benchmarks. Highlighted = best in column. Model column is frozen — scroll freely.

Best in column* = estimated or thinking-mode scoreCN ⚠ = Chinese jurisdictionAPI $/1M = blended at 3:1 input:outputHow we score →

#	Model	Score	AA Index	Context	Speed t/s	API $/1M	Free?	Open?	Multi?	Trust	GPQA ⬦	HLE ⬦	SWE-bench ⬦	MMLU-Pro ⬦	Released
1	Gemini 3.1 Pro Google top-pick	9.0/10	57	1M	108.6	$4.50	± ltd	✗	✓	US/EU	94.1%	44.4%	80.6%	—	2026-02-19
2	Gemini 3 Pro Google best-value	7.9/10	48.44	1M	55*	$4.50	✓ full	✗	✓	US/EU	91.9%	37.5%	76.2%	90.1%	2025-11-18
3	GPT-5.2 OpenAI top-pick	7.4/10	46.58	400K	65*	$4.81	± ltd	✗	✓	US/EU	90.3%	28.0%*	80.0%	87.5%	2025-12-11
4	Grok 4.1 xAI	7.4/10	41.43	2M	90*	$6.00	± ltd	✗	✓	US/EU	88.0%*	45.0%*	75.0%*	86.6%	2025-11-17
5	Claude Sonnet 4.6 Anthropic	7.1/10	44.33	200K	85*	$6.00	± ltd	✗	✓	US/EU	89.9%*	33.2%	79.6%	89.3%*	2026-02-17
6	Llama 4 Scout Meta open-source	6.9/10	38.5*	10M	180	$0.11	—	✓	✓	US/EU	—	—	—	—	2025-04-05
7	Gemini 3 Flash Google fastest	6.8/10	35	1M	170	$1.13	✓ full	✗	✓	US/EU	90.4%	—	78.0%*	88.6%	2025-12-17
8	Claude Opus 4.6 Anthropic top-pick	6.6/10	46	200K	67	$10.00	—	✗	✓	US/EU	91.3%	40.0%	80.8%	—	2026-02-05
9	GPT-5 mini OpenAI best-value	6.5/10	39	400K	73	$0.69	± ltd	✗	✓	US/EU	—	—	74.9%*	—	2025-08-07
10	Claude Haiku 4.5 Anthropic best-value	5.6/10	31	200K	108.8	$2.00	± ltd	✗	✓	US/EU	—	—	—	—	2025-10-15
11	DeepSeek V3.2 DeepSeek	5.5/10	41.61	128K	45*	$0.48	✓ full	✓	✗	CN ⚠	82.4%	—	—	85.0%	2025-12-01
12	GPT OSS 120B OpenAI open-source	5.4/10	33	131K	336	$0.26	± ltd	✓	✗	US/EU	80.9%*	—	—	—	2025-08-05
13	Kimi K2 Moonshot AI open-source	5.0/10	31	262K	62.3	$0.77	± ltd	✓	✗	CN ⚠	84.5%*	—	—	84.6%*	2025-09-05
14	Mistral Large 3 Mistral	4.4/10	23	256K	56	$0.75	± ltd	✓	✓	US/EU	—	—	—	85.5%*	2025-12-02
15	Llama 4 Maverick Meta open-source	4.4/10	18	1M	125	$0.44	—	✓	✓	US/EU	69.8%	—	—	80.5%	2025-04-05
16	Qwen 3 235B Alibaba open-source	3.3/10	17	262K	40.7	$0.37	± ltd	✓	✗	CN ⚠	81.1%*	—	—	—	2025-04-29

⬦ = benchmark columns. * = estimated or thinking/reasoning-mode score — treat as approximate. | GPQA Diamond: 198 PhD-level science Qs (random = 25%). HLE: Humanity’s Last Exam — hardest public benchmark. SWE-bench: real GitHub issues resolved. MMLU-Pro: 12K graduate-level Qs, 10 choices.

Score = composite quality rating 0–10. AA Index = Artificial Analysis Intelligence Index v4.0. API $/1M = blended at 3:1 input:output. Full methodology →

Last verified February 2026. Benchmark scores sourced from Artificial Analysis, official model announcements, Vellum, and technical reports.