The LLM Accuracy Index
26 models · 129 task types · 11 modalities · 27,439 fields tested
Cost per Million Tokens vs Accuracy
Bubble = fields tested. Top-left = best value./recommend curl "modelranking.ai/v1/benchmarks/recommend?task_type=invoice&cost_limit=0.05" Best provider for any task type. 10 calls/day.
/detailed + /compare curl "modelranking.ai/v1/benchmarks/compare?models=claude-haiku,gpt-4o-mini" \
-H "x-api-key: KEY" How it works
129 Task Families
Document extraction (invoices, W-2s, 1099s, leases, medical bills), classification (sentiment, spam, intent), code generation (Python, JS, SQL, Bash), math & reasoning, translation, summarization, image extraction, and video understanding.
7 Difficulty Tiers
Each family is tested at Clean, Noisy, Adversarial, and Degenerate difficulty levels (plus 3 additional tiers). Synthetic test data is generated deterministically — no human-in-the-loop, no vibes, no prompt engineering.
26 Models Scored
OpenAI (GPT-4o, 4.1, o4-mini), Anthropic (Claude Haiku/Sonnet/Opus), Google (Gemini Flash/Pro), Mistral, DeepSeek, Groq (Llama), xAI (Grok), Cohere. Every result is deterministic and reproducible.
Powered by Crystallized Intelligence (CI) — deterministic synthetic benchmarks, 100x cheaper than LLM-graded evals. See it in production at BookPull ↗