The Problem
Current AI benchmarks measure the wrong thing.
MMLU, HumanEval, SWE-bench — these evaluate a model's capability at a single step in isolation. That is a useful proxy when you are choosing between models for a chatbot. It is a poor proxy when you are building an agent that needs to complete a 40-step research task, orchestrate a multi-file refactor, or run a document synthesis pipeline without falling apart halfway through.
The practical reality of production agentic systems: a model scoring 95% on a single task may be at 20% success by step 30. The degradation curve is not visible in any standard benchmark. It is only visible in production — which is an expensive place to discover it.
HorizonBench makes this curve visible before you ship.
What It Measures
HorizonBench evaluates AI agent reliability over long-horizon task sequences, not peak capability at a single step.
Four benchmark task families:
- Multi-file refactoring — coordinate changes across a codebase while maintaining consistency
- Data pipeline execution — run sequential data transformations without state corruption
- Document research synthesis — gather, reconcile, and summarize information across sources
- Constraint-based scheduling — satisfy interacting constraints across a growing problem space
Each family is run at increasing step counts — 5, 10, 20, 30, 50, 100+ — and success rates are recorded at each level. The output is a Reliability Decay Curve (RDC) for each model: a direct measurement of how quickly performance collapses as task length grows.
Three metrics that actually matter for agent selection:
- RDC (Reliability Decay Curve) — the full performance profile from k=5 to k=100+
- MOP (Meltdown Onset Point) — the step count at which a model's success rate drops below 50%
- GDS (Graceful Degradation Score) — whether the model fails hard or degrades gracefully
"The question is not which model is smartest. The question is which model stays reliable longest."
Architecture
HorizonBench is built on LiteLLM, which means it runs against any compatible model — Anthropic, OpenAI, Google, or local models via Ollama. The evaluation harness is written in Python and distributed as a CLI tool.
The evaluation loop is deliberately simple. Each task family produces a task instance at a given step count. The model is asked to complete the task. The result is scored against ground truth. Success rates are aggregated across iterations. The decay curve is computed.
# Run all task families against Claude Sonnet and GPT-4o
horizonbench run --models claude-sonnet-4-5,gpt-4o --steps 5,10,20,50
# Run a specific family, more iterations for statistical confidence
horizonbench run --families refactor --iterations 20
# Export results as an interactive leaderboard
horizonbench export --format html
The interactive leaderboard output lets you compare models side-by-side on their full decay curves — not just a single headline number.
Why This Matters
If you are building a production agentic system, the model selection decision is one of the highest-leverage choices you make. A model that scores 5 percentage points better on a standard benchmark might be 30 percentage points worse at the step counts your system actually operates at.
HorizonBench gives you the data to make that decision correctly. It is free, runs against any model you already have API access to, and produces results in an hour on a standard laptop.
Installation
Install free with a single command — no account, no signup, no license required.
After installation, run the interactive setup to configure your API keys:
horizonbench setup
HorizonBench supports API keys for Anthropic, OpenAI, and Google. You only need keys for the models you want to evaluate.