The LLM Accuracy Index

We test 26 AI models on 192 real tasks — extraction, classification, code generation, math, translation — using synthetic documents with known correct answers. Every score is deterministic. No human judgment. The best model depends on the task.

26 models · 192 tasks · 86,736 fields · 11 modalities

every 4h

FREE /v1/benchmarks/recommend?task_type=invoice&cost_limit=0.05

Best provider for any task type. 10 calls/day free. $49/mo for /detailed and /compare.

About this data

Synthetic ground truth

We generate the test documents, so we know the exact right answer before any model sees them. No human labeling. No ambiguity. No training data contamination. Fresh test cases generated every 3 seconds.

Field-by-field scoring

We don't ask "is this response good?" We check each field individually. Did the model get the invoice number right? The date? The total? Accuracy = percentage of fields extracted correctly. Deterministic, reproducible, binary.

Task-specific rankings

The overall leaderboard is an average. The real insight is per-task. 90% of73 task types have a different #1 model than the overall leader. Click any task to see which model is actually best for that specific job.