The LLM Accuracy Index

26 models · 129 task types · 11 modalities · 27,439 fields tested

Updated every 4h
FREE /recommend
curl "modelranking.ai/v1/benchmarks/recommend?task_type=invoice&cost_limit=0.05"

Best provider for any task type. 10 calls/day.

$49/mo
/detailed + /compare
curl "modelranking.ai/v1/benchmarks/compare?models=claude-haiku,gpt-4o-mini" \
  -H "x-api-key: KEY"

How it works

129 Task Families

Document extraction (invoices, W-2s, 1099s, leases, medical bills), classification (sentiment, spam, intent), code generation (Python, JS, SQL, Bash), math & reasoning, translation, summarization, image extraction, and video understanding.

7 Difficulty Tiers

Each family is tested at Clean, Noisy, Adversarial, and Degenerate difficulty levels (plus 3 additional tiers). Synthetic test data is generated deterministically — no human-in-the-loop, no vibes, no prompt engineering.

26 Models Scored

OpenAI (GPT-4o, 4.1, o4-mini), Anthropic (Claude Haiku/Sonnet/Opus), Google (Gemini Flash/Pro), Mistral, DeepSeek, Groq (Llama), xAI (Grok), Cohere. Every result is deterministic and reproducible.

Powered by Crystallized Intelligence (CI) — deterministic synthetic benchmarks, 100x cheaper than LLM-graded evals. See it in production at BookPull ↗