No single model wins everything. → Browse 192 task benchmarks
The LLM Accuracy Index
We test 26 AI models on 192 real tasks — extraction, classification, code generation, math, translation — using synthetic documents with known correct answers. Every score is deterministic. No human judgment. The best model depends on the task.
26 models · 192 tasks · 86,736 fields · 11 modalities
/v1/benchmarks/recommend?task_type=invoice&cost_limit=0.05 Best provider for any task type. 10 calls/day free. $49/mo for /detailed and /compare.
About this data
Synthetic ground truth
We generate the test documents, so we know the exact right answer before any model sees them. No human labeling. No ambiguity. No training data contamination. Fresh test cases generated every 3 seconds.
Field-by-field scoring
We don't ask "is this response good?" We check each field individually. Did the model get the invoice number right? The date? The total? Accuracy = percentage of fields extracted correctly. Deterministic, reproducible, binary.
Task-specific rankings
The overall leaderboard is an average. The real insight is per-task. 90% of183 task types have a different #1 model than the overall leader. Click any task to see which model is actually best for that specific job.