RouterArena is an open evaluation platform and leaderboard for LLM routers—systems that automatically select the best model for a given query. As the LLM ecosystem diversifies with models varying in size, capability, and cost, routing has become critical for balancing performance and cost. Yet, LLM routers currently lack a standardized evaluation framework to assess how effectively they trade off accuracy, cost, and other related metrics.
RouterArena bridges this gap by providing an open evaluation platform and benchmarking framework for both open-source and commercial routers. It has the following key features:
- 🌍 Diverse Data Coverage: A principly-constructed, diverse evaluation dataset spanning 9 domains and 44 categories with easy, medium, and hard difficulty levels.
- 📊 Comprehensive Metrics: Five router-critical metrics measuring accuracy, cost, optimality, robustness, and latency.
- ⚙️ Automated Evaluation: An automated evaluation framework to simplify the evaluation process for open-source and commercial routers.
- 🏆 Live Leaderboard: A live leaderboard to track the performance of routers across multiple dimensions.
We aim for RouterArena to serve as a foundation for the community to evaluate, understand, and advance LLM routing systems.
For more details, please see our website and blog.
| Rank | Router | Affiliation | Acc-Cost Arena | Accuracy | Cost/1K Queries | Optimal Selection | Optimal Cost | Optimal Accuracy | Latency | Robustness |
|---|---|---|---|---|---|---|---|---|---|---|
| 🥇 | MIRT‑BERT [GH] | 🎓 USTC | 66.89 | 66.88 | $0.15 | 3.44 | 19.62 | 78.18 | 27.03 | 61.19 |
| 🥈 | Azure‑Router [Web] | 💼 Microsoft | 66.66 | 68.09 | $0.54 | 22.52 | 46.32 | 81.96 | — | 54.07 |
| 🥉 | NIRT‑BERT [GH] | 🎓 USTC | 66.12 | 66.34 | $0.21 | 3.83 | 14.04 | 77.88 | 10.42 | 49.29 |
| 4 | GPT‑5 | 💼 OpenAI | 64.32 | 73.96 | $10.02 | — | — | — | — | — |
| 5 | vLLM‑SR [GH] [HF] | 🎓 vLLM SR Team | 64.32 | 67.28 | $1.67 | 4.79 | 12.54 | 79.33 | 0.19 | 35.00 |
| 6 | CARROT [GH] [HF] | 🎓 UMich | 63.87 | 67.21 | $2.06 | 2.68 | 6.77 | 78.63 | 1.50 | 89.05 |
| 7 | Chayan [HF] | 🎓 Adaptive Classifier | 63.83 | 64.89 | $0.56 | 43.03 | 43.75 | 88.74 | — | — |
| 8 | RouterBench‑MLP [GH] [HF] | 🎓 Martian | 57.56 | 61.62 | $4.83 | 13.39 | 24.45 | 83.32 | 90.91 | 80.00 |
| 9 | NotDiamond | 💼 NotDiamond | 57.29 | 60.83 | $4.10 | 1.55 | 2.14 | 76.81 | — | 55.91 |
| 10 | GraphRouter [GH] | 🎓 UIUC | 57.22 | 57.00 | $0.34 | 4.73 | 38.33 | 74.25 | 2.70 | 94.29 |
| 11 | RouterBench‑KNN [GH] [HF] | 🎓 Martian | 55.48 | 58.69 | $4.27 | 13.09 | 25.49 | 78.77 | 1.33 | 83.33 |
| 12 | RouteLLM [GH] [HF] | 🎓 Berkeley | 48.07 | 47.04 | $0.27 | 99.72 | 99.63 | 68.76 | 0.40 | 100.00 |
| 13 | RouterDC [GH] | 🎓 SUSTech | 33.75 | 32.01 | $0.07 | 39.84 | 73.00 | 49.05 | 10.75 | 85.24 |
🎓 Open-source 💼 Closed-source
curl -LsSf https://astral.sh/uv/install.sh | sh
cd RouterArena
uv syncDownload the dataset from HF dataset.
uv run python ./scripts/process_datasets/prep_datasets.pyIn the project root, copy .env.example as .env and update the API keys in .env. This step is required only if you use our pipeline for LLM inferences.
# Example .env file
OPENAI_API_KEY=<Your-Key>
ANTHROPIC_API_KEY=<Your-Key>
# ...See the ModelInference class for the complete list of supported providers and required environment variables. You can extend that class to support more models, or submit a GitHub issue to request support for new providers.
Follow the steps below to obtain your router's model choices for each query. Start with the sub_10 split (a 10% subset) for local testing. Once your setup works, you can evaluate on the full dataset for full local evaluation and official leaderboard submission.
Create a config file in ./router_inference/config/<router_name>.json. An example config file is included here.
{
"pipeline_params": {
"router_name": "your-router",
"models": [
"gpt-4o-mini",
"claude-3-haiku-20240307",
"gemini-2.0-flash-001"
]
}
}For each model in your config, add an entry with the pricing per million tokens in this format at model_cost/cost.json:
{
"gpt-4o-mini": {
"input_token_price_per_million": 0.15,
"output_token_price_per_million": 0.6
},
}Note
Ensure all models in your above config files are listed in ./universal_model_names.py. If you add a new model, you must also add the API inference endpoint in llm_inference/model_inference.py.
Create your own router class by inheriting from BaseRouter and implementing the _get_prediction() method. See router_inference/router/example_router.py for a complete example.
Then, modify router_inference/generate_prediction_file.py to use your router class:
# Replace ExampleRouter with your router class
from router_inference.router.my_router import MyRouter
router = MyRouter(args.router_name)Finally, generate the prediction file:
uv run python ./router_inference/generate_prediction_file.py your-router [sub_10|full]Note
- The
<your-router>argument must match your config filename (without the.jsonextension). For example, if your config file isrouter_inference/config/my-router.json, usemy-routeras the argument. - Your
_get_prediction()method must return a model name that exists in your config file'smodelslist. The base class will automatically validate this.
uv run python ./router_inference/check_config_prediction_files.py your-router [sub_10|full]This script checks: (1) all model names are valid, (2) prediction file has correct size (809 for sub_10, 8400 for full), and (3) all entries have valid global_index, prompt, and prediction fields.
Run the inference script to make API calls for each query using the selected models:
uv run python ./llm_inference/run.py your-routerThe script loads your prediction file, makes API calls using the models specified in the prediction field, and saves results incrementally. It uses cached results when available and saves progress after each query, so you can safely interrupt and resume. Results are saved to ./cached_results/ for reuse across routers.
As the last step, run the evaluation script:
uv run python ./llm_evaluation/run.py your-router [sub_10|full]To get your router on the leaderboard, you can open a Pull Request with your router's prediction file to trigger our automated evaluation workflow. Details are as follows:
- Add your files:
router_inference/config/<router_name>.json- Your router configurationrouter_inference/predictions/<router_name>.json- Your prediction file withgenerated_resultfields populated
- Open a Pull Request to
mainbranch - The automated workflow will:- Validate your submission
- Run evaluation on the full dataset
- Post results as a comment on your PR
- Update the leaderboard upon approval
The Figure below shows the evaluation pipeline.
We welcome and appreciate contributions and collaborations of any kind.
We use pre-commit to ensure a consistent coding style. You can set it up by
pip install pre-commit
pre-commit installBefore pushing your code, run the following and make sure your code passes all checks.
pre-commit run --all-filesFeel free to contact us for contributions and collaborations.
Yifan Lu (yifan.lu@rice.edu)
Jiarong Xing (jxing@rice.edu)
If you find our project helpful, please give us a star and cite us by:
@misc{lu2025routerarenaopenplatformcomprehensive,
title = {RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers},
author = {Yifan Lu and Rixin Liu and Jiayi Yuan and Xingqi Cui and Shenrun Zhang and Hongyi Liu and Jiarong Xing},
year = {2025},
eprint = {2510.00202},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2510.00202}
}


