All LLM Reviews
11 models reviewed and rated. Sorted by overall score. Updated monthly.
Google's frontier model and the best value at the top tier. At $2/$12 per 1M tokens via the API, Gemini 3 Pro undercuts both Claude and GPT-5.2 while matching them on most benchmarks. The 1M token context window and Google Workspace integration are hard to beat.
OpenAI's current flagship. GPT-5.2 significantly outpaces GPT-4o — it has a 400K token context window, a hallucination rate down to 6.2%, and perfect scores on the AIME 2025 math benchmark. The model most people using ChatGPT are now running on.
Anthropic's mid-tier model and the practical daily-driver recommendation. Sonnet 4.6 sits just below Opus in raw intelligence but costs 80% less. It's the best model for writing, analysis, and long-document work for anyone who isn't running enterprise-scale inference.
xAI's Grok 4.1 has two things nobody else offers: real-time access to X (Twitter) data and a 2 million token context window. Access comes bundled with X Premium — so if you're already paying for X, Grok is effectively included.
Meta's open-source flagship has a 10 million token context window — by a wide margin the largest of any model available. The weights are free to download under Meta's Llama 4 license, but running it costs compute. Via Groq it's among the cheapest options at ~$0.11/1M tokens.
Anthropic's most powerful model and the top-ranked non-reasoning LLM on the Artificial Analysis Intelligence Index as of February 2026 (AA Index 46). Opus 4.6 is the model you reach for when quality matters more than cost: complex multi-step analysis, high-stakes creative work, and agentic workflows where a small output quality difference has real downstream consequences. The price — $5/$25 per 1M tokens — reflects that positioning. Unrestricted consumer access requires the Claude Max plan ($100/month).
OpenAI's budget reasoning model and one of the most interesting value plays in the current field. GPT-5 mini runs in medium-effort reasoning mode by default and scores 39 on the Artificial Analysis Intelligence Index — higher than several premium-priced non-reasoning models — at $0.25/$2.00 per 1M tokens. That combination makes it smarter per dollar than most alternatives in its price tier. The 400K context window and multimodal input support round out a genuinely capable package for developers who need better-than-baseline quality without flagship pricing.
Google's speed-optimized model that closes surprising ground on intelligence. Released December 2025, Gemini 3 Flash scores 35 on the Artificial Analysis Intelligence Index — higher than several models that cost five to ten times more per token — while running at 170 tokens per second. At $0.50/$3.00 per 1M, it's genuinely cheap for high-volume API use. The 1M token context window and native video/audio/image input make it the practical go-to for multimodal pipelines that need throughput without paying Gemini 3 Pro prices.
DeepSeek's latest model continues to shock with its price-to-performance ratio. V3.2 introduces 'Fine-Grained Sparse Attention' for 50% better compute efficiency. Input costs drop to $0.07/1M tokens with cache hits. The web interface at chat.deepseek.com appears to be free with no hard usage cap.
Mistral's December 2025 flagship and the most commercially permissive large model in this comparison. Mistral Large 3 is released under Apache 2.0 — genuinely open for commercial use without royalties or usage restrictions. At 675B total parameters with 41B active per token (mixture-of-experts), it scores 23 on the Artificial Analysis Intelligence Index at $0.50/$1.50 per 1M tokens. For enterprise teams that need open-weight licensing terms, the math is straightforward: comparable capability to other open-weight models, completely unrestricted commercial use, and a 256K context window that covers most document workflows.
Meta's midweight open-source model in the Llama 4 family — larger than Scout (402B total parameters, 17B active via mixture-of-experts) with a 1M token context window and notably fast inference at 124.6 t/s. Artificial Analysis Intelligence Index scores it at 18, below frontier models, but Maverick is not designed to compete on raw reasoning. It exists for workloads where open weights + massive context + low API cost matters more than cutting-edge benchmark performance. At $0.44/1M blended via Together AI, it's one of the cheapest options for large-context production API use.
Ratings based on real-world task testing and independent benchmark data. Pricing verified from provider docs — check official sites for current rates.