Make Router Evaluation Open and Standardized

RouterArena is an open evaluation platform and leaderboard for LLM routers—systems that automatically select the best model for a given query. As the LLM ecosystem diversifies with models varying in size, capability, and cost, routing has become critical for balancing performance and cost. Yet, LLM routers currently lack a standardized evaluation framework to assess how effectively they trade off accuracy, cost, and other related metrics.

RouterArena bridges this gap by providing an open evaluation platform and benchmarking framework for both open-source and commercial routers. It has the following key features:

🌍 Diverse Data Coverage: A principly-constructed, diverse evaluation dataset spanning 9 domains and 44 categories with easy, medium, and hard difficulty levels.
📊 Comprehensive Metrics: Five router-critical metrics measuring accuracy, cost, optimality, robustness, and latency.
⚙️ Automated Evaluation: An automated evaluation framework to simplify the evaluation process for open-source and commercial routers.
🏆 Live Leaderboard: A live leaderboard to track the performance of routers across multiple dimensions.

We aim for RouterArena to serve as a foundation for the community to evaluate, understand, and advance LLM routing systems.

Current Leaderboard

For more details, please see our website and blog.

Rank	Router	Affiliation	Acc-Cost Arena	Accuracy	Cost/1K Queries	Optimal Selection	Optimal Cost	Optimal Accuracy	Latency	Robustness
🥇	MIRT‑BERT [GH]	🎓 USTC	66.89	66.88	$0.15	3.44	19.62	78.18	27.03	61.19
🥈	Azure‑Router [Web]	💼 Microsoft	66.66	68.09	$0.54	22.52	46.32	81.96	—	54.07
🥉	NIRT‑BERT [GH]	🎓 USTC	66.12	66.34	$0.21	3.83	14.04	77.88	10.42	49.29
4	GPT‑5	💼 OpenAI	64.32	73.96	$10.02	—	—	—	—	—
5	vLLM‑SR [GH] [HF]	🎓 vLLM SR Team	64.32	67.28	$1.67	4.79	12.54	79.33	0.19	35.00
6	CARROT [GH] [HF]	🎓 UMich	63.87	67.21	$2.06	2.68	6.77	78.63	1.50	89.05
7	Chayan [HF]	🎓 Adaptive Classifier	63.83	64.89	$0.56	43.03	43.75	88.74	—	—
8	RouterBench‑MLP [GH] [HF]	🎓 Martian	57.56	61.62	$4.83	13.39	24.45	83.32	90.91	80.00
9	NotDiamond	💼 NotDiamond	57.29	60.83	$4.10	1.55	2.14	76.81	—	55.91
10	GraphRouter [GH]	🎓 UIUC	57.22	57.00	$0.34	4.73	38.33	74.25	2.70	94.29
11	RouterBench‑KNN [GH] [HF]	🎓 Martian	55.48	58.69	$4.27	13.09	25.49	78.77	1.33	83.33
12	RouteLLM [GH] [HF]	🎓 Berkeley	48.07	47.04	$0.27	99.72	99.63	68.76	0.40	100.00
13	RouterDC [GH]	🎓 SUSTech	33.75	32.01	$0.07	39.84	73.00	49.05	10.75	85.24

🎓 Open-source 💼 Closed-source

Evaluating Your Router

1. Setup

Step 1.1: Install uv and RouterArena

curl -LsSf https://astral.sh/uv/install.sh | sh
cd RouterArena
uv sync

Step 1.2: Download Dataset

Download the dataset from HF dataset.

uv run python ./scripts/process_datasets/prep_datasets.py

Step 1.3: Set Up API Keys (Optional)

In the project root, copy .env.example as .env and update the API keys in .env. This step is required only if you use our pipeline for LLM inferences.

# Example .env file
OPENAI_API_KEY=<Your-Key>
ANTHROPIC_API_KEY=<Your-Key>
# ...

See the ModelInference class for the complete list of supported providers and required environment variables. You can extend that class to support more models, or submit a GitHub issue to request support for new providers.

2. Get Routing Decisions

Follow the steps below to obtain your router's model choices for each query. Start with the sub_10 split (a 10% subset) for local testing. Once your setup works, you can evaluate on the full dataset for full local evaluation and official leaderboard submission.

Step 2.1: Prepare Config File

Create a config file in ./router_inference/config/<router_name>.json. An example config file is included here.

{
  "pipeline_params": {
      "router_name": "your-router",
      "models": [
          "gpt-4o-mini",
          "claude-3-haiku-20240307",
          "gemini-2.0-flash-001"
      ]
  }
}

For each model in your config, add an entry with the pricing per million tokens in this format at model_cost/cost.json:

{
  "gpt-4o-mini": {
    "input_token_price_per_million": 0.15,
    "output_token_price_per_million": 0.6
  },
}

Note

Ensure all models in your above config files are listed in ./universal_model_names.py. If you add a new model, you must also add the API inference endpoint in llm_inference/model_inference.py.

Step 2.2: Create Your Router Class and Generate Prediction File

Create your own router class by inheriting from BaseRouter and implementing the _get_prediction() method. See router_inference/router/example_router.py for a complete example.

Then, modify router_inference/generate_prediction_file.py to use your router class:

# Replace ExampleRouter with your router class
from router_inference.router.my_router import MyRouter
router = MyRouter(args.router_name)

Finally, generate the prediction file:

uv run python ./router_inference/generate_prediction_file.py your-router [sub_10|full]

Note

The <your-router> argument must match your config filename (without the .json extension). For example, if your config file is router_inference/config/my-router.json, use my-router as the argument.
Your _get_prediction() method must return a model name that exists in your config file's models list. The base class will automatically validate this.

Step 2.3: Validate Config and Prediction Files

uv run python ./router_inference/check_config_prediction_files.py your-router [sub_10|full]

This script checks: (1) all model names are valid, (2) prediction file has correct size (809 for sub_10, 8400 for full), and (3) all entries have valid global_index, prompt, and prediction fields.

3. Run LLM Inference

Run the inference script to make API calls for each query using the selected models:

uv run python ./llm_inference/run.py your-router

The script loads your prediction file, makes API calls using the models specified in the prediction field, and saves results incrementally. It uses cached results when available and saves progress after each query, so you can safely interrupt and resume. Results are saved to ./cached_results/ for reuse across routers.

4. Run Router Evaluation

As the last step, run the evaluation script:

uv run python ./llm_evaluation/run.py your-router [sub_10|full]

Submitting to the leaderboard

To get your router on the leaderboard, you can open a Pull Request with your router's prediction file to trigger our automated evaluation workflow. Details are as follows:

Add your files:
- router_inference/config/<router_name>.json - Your router configuration
- router_inference/predictions/<router_name>.json - Your prediction file with generated_result fields populated
Open a Pull Request to main branch - The automated workflow will:
- Validate your submission
- Run evaluation on the full dataset
- Post results as a comment on your PR
- Update the leaderboard upon approval

The Figure below shows the evaluation pipeline.

Contributing

We welcome and appreciate contributions and collaborations of any kind.

We use pre-commit to ensure a consistent coding style. You can set it up by

pip install pre-commit
pre-commit install

Before pushing your code, run the following and make sure your code passes all checks.

pre-commit run --all-files

Contacts

Feel free to contact us for contributions and collaborations.

Yifan Lu (yifan.lu@rice.edu)
Jiarong Xing (jxing@rice.edu)

Citation:

If you find our project helpful, please give us a star and cite us by:

@misc{lu2025routerarenaopenplatformcomprehensive,
  title        = {RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers},
  author       = {Yifan Lu and Rixin Liu and Jiayi Yuan and Xingqi Cui and Shenrun Zhang and Hongyi Liu and Jiarong Xing},
  year         = {2025},
  eprint       = {2510.00202},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  url          = {https://arxiv.org/abs/2510.00202}
}

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.gemini		.gemini
.github/workflows		.github/workflows
automation		automation
cached_results		cached_results
config		config
images		images
llm_evaluation		llm_evaluation
llm_inference		llm_inference
model_cost		model_cost
router_evaluation		router_evaluation
router_inference		router_inference
scripts/process_datasets		scripts/process_datasets
tools		tools
.env.example		.env.example
.gitignore		.gitignore
.license-header.txt		.license-header.txt
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
universal_model_names.py		universal_model_names.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Make Router Evaluation Open and Standardized

Current Leaderboard

Evaluating Your Router

1. Setup

Step 1.1: Install uv and RouterArena

Step 1.2: Download Dataset

Step 1.3: Set Up API Keys (Optional)

2. Get Routing Decisions

Step 2.1: Prepare Config File

Step 2.2: Create Your Router Class and Generate Prediction File

Step 2.3: Validate Config and Prediction Files

3. Run LLM Inference

4. Run Router Evaluation

Submitting to the leaderboard

Contributing

Contacts

Citation:

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

RouteWorks/RouterArena

Folders and files

Latest commit

History

Repository files navigation

Make Router Evaluation Open and Standardized

Current Leaderboard

Evaluating Your Router

1. Setup

Step 1.1: Install uv and RouterArena

Step 1.2: Download Dataset

Step 1.3: Set Up API Keys (Optional)

2. Get Routing Decisions

Step 2.1: Prepare Config File

Step 2.2: Create Your Router Class and Generate Prediction File

Step 2.3: Validate Config and Prediction Files

3. Run LLM Inference

4. Run Router Evaluation

Submitting to the leaderboard

Contributing

Contacts

Citation:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages