Rankings
Best Multimodal AI Models
Models that accept both text and images as input, ranked by overall quality score. Useful for document analysis, screenshot debugging, visual Q&A, and mixed-media workflows. Text-only models are excluded. Rankings use the same composite score as overall rankings — price not included.
top multimodal model
Gemini 3.1 Pro
Google · Quality 9.0/10 · AA Index 57
Google's reasoning-optimized flagship, released February 19, 2026, and currently the #1 ranked model on the Artificial Analysis Intelligence Index (score: 57 out of 114 models). Gemini 3.1 Pro is a direct upgrade to Gemini 3 Pro — same 1M token context window and same $2/$12 pricing — but with dramatically improved reasoning. Its ARC-AGI-2 abstract reasoning score more than doubled from 31.1% to 77.1%, and it nearly doubled its APEX-Agents agentic task score (18.4% → 33.5%). It leads on scientific knowledge (GPQA Diamond 94.3%), competitive coding (LiveCodeBench Pro Elo 2887), and multi-step agentic search (BrowseComp 85.9%). A dedicated custom-tools API endpoint is available for agentic pipeline use. Currently in preview — generally available soon.
Google's frontier model and the best value at the top tier. At $2/$12 per 1M tokens via the API, Gemini 3 Pro undercuts both Claude and GPT-5.2 while matching them on most benchmarks. The 1M token context window and Google Workspace integration are hard to beat.
OpenAI's current flagship. GPT-5.2 significantly outpaces GPT-4o — it has a 400K token context window, a hallucination rate down to 6.2%, and perfect scores on the AIME 2025 math benchmark. The model most people using ChatGPT are now running on.
xAI's Grok 4.1 has two things nobody else offers: real-time access to X (Twitter) data and a 2 million token context window. Access comes bundled with X Premium — so if you're already paying for X, Grok is effectively included.
Anthropic's mid-tier model and the practical daily-driver recommendation. Sonnet 4.6 sits just below Opus in raw intelligence but costs 80% less. It's the best model for writing, analysis, and long-document work for anyone who isn't running enterprise-scale inference.
Meta's open-source flagship has a 10 million token context window — by a wide margin the largest of any model available. The weights are free to download under Meta's Llama 4 license, but running it costs compute. Via Groq it's among the cheapest options at ~$0.11/1M tokens.
Google's speed-optimized model that closes surprising ground on intelligence. Released December 2025, Gemini 3 Flash scores 35 on the Artificial Analysis Intelligence Index — higher than several models that cost five to ten times more per token — while running at 170 tokens per second. At $0.50/$3.00 per 1M, it's genuinely cheap for high-volume API use. The 1M token context window and native video/audio/image input make it the practical go-to for multimodal pipelines that need throughput without paying Gemini 3 Pro prices.
Anthropic's most powerful model and the top-ranked non-reasoning LLM on the Artificial Analysis Intelligence Index as of February 2026 (AA Index 46). Opus 4.6 is the model you reach for when quality matters more than cost: complex multi-step analysis, high-stakes creative work, and agentic workflows where a small output quality difference has real downstream consequences. The price — $5/$25 per 1M tokens — reflects that positioning. Unrestricted consumer access requires the Claude Max plan ($100/month).
OpenAI's budget reasoning model and one of the most interesting value plays in the current field. GPT-5 mini runs in medium-effort reasoning mode by default and scores 39 on the Artificial Analysis Intelligence Index — higher than several premium-priced non-reasoning models — at $0.25/$2.00 per 1M tokens. That combination makes it smarter per dollar than most alternatives in its price tier. The 400K context window and multimodal input support round out a genuinely capable package for developers who need better-than-baseline quality without flagship pricing.
Anthropic's fastest and most affordable model in the Claude 4 generation, released October 2025. Claude Haiku 4.5 runs at 108.8 tokens/second — fast enough for real-time streaming — at $1/$5 per 1M tokens. Despite the low price, it scores an AA Intelligence Index of 31, placing it #13 of 60 proprietary models. It outperforms Claude Sonnet 4 on computer-use benchmarks (50.7% vs 42.2%) while costing three times less. Supports extended thinking mode (billed at $5/1M for thinking tokens), image input, and the full 200K context window shared across the Claude 4 generation.
Mistral's December 2025 flagship and the most commercially permissive large model in this comparison. Mistral Large 3 is released under Apache 2.0 — genuinely open for commercial use without royalties or usage restrictions. At 675B total parameters with 41B active per token (mixture-of-experts), it scores 23 on the Artificial Analysis Intelligence Index at $0.50/$1.50 per 1M tokens. For enterprise teams that need open-weight licensing terms, the math is straightforward: comparable capability to other open-weight models, completely unrestricted commercial use, and a 256K context window that covers most document workflows.
Meta's midweight open-source model in the Llama 4 family — larger than Scout (402B total parameters, 17B active via mixture-of-experts) with a 1M token context window and notably fast inference at 124.6 t/s. Artificial Analysis Intelligence Index scores it at 18, below frontier models, but Maverick is not designed to compete on raw reasoning. It exists for workloads where open weights + massive context + low API cost matters more than cutting-edge benchmark performance. At $0.44/1M blended via Together AI, it's one of the cheapest options for large-context production API use.
What counts as multimodal?
These models accept at least image + text input. Several also support audio, video frames, or document uploads. “Multimodal” does not mean they generate images — for image generation see Image Generators. Want to browse without rank order? Browse all multimodal models →
Last updated February 2026. Intelligence scores from Artificial Analysis. See how we rate for full methodology.