Skip to content

RouterArena: An open framework for evaluating LLM routers with standardized datasets, metrics, an automated framework, and a live leaderboard.

License

Notifications You must be signed in to change notification settings

RouteWorks/RouterArena

Repository files navigation

RouterArena logo

Blog arXiv: RouterArena Hugging Face Dataset

Make Router Evaluation Open and Standardized

RouterArena Diagram

RouterArena is an open evaluation platform and leaderboard for LLM routers—systems that automatically select the best model for a given query. As the LLM ecosystem diversifies with models varying in size, capability, and cost, routing has become critical for balancing performance and cost. Yet, LLM routers currently lack a standardized evaluation framework to assess how effectively they trade off accuracy, cost, and other related metrics.

RouterArena bridges this gap by providing an open evaluation platform and benchmarking framework for both open-source and commercial routers. It has the following key features:

  • 🌍 Diverse Data Coverage: A principly-constructed, diverse evaluation dataset spanning 9 domains and 44 categories with easy, medium, and hard difficulty levels.
  • 📊 Comprehensive Metrics: Five router-critical metrics measuring accuracy, cost, optimality, robustness, and latency.
  • ⚙️ Automated Evaluation: An automated evaluation framework to simplify the evaluation process for open-source and commercial routers.
  • 🏆 Live Leaderboard: A live leaderboard to track the performance of routers across multiple dimensions.

We aim for RouterArena to serve as a foundation for the community to evaluate, understand, and advance LLM routing systems.

Current Leaderboard

For more details, please see our website and blog.

Rank Router Affiliation Acc-Cost Arena Accuracy Cost/1K Queries Optimal Selection Optimal Cost Optimal Accuracy Latency Robustness
🥇 MIRT‑BERT [GH] 🎓 USTC 66.89 66.88 $0.15 3.44 19.62 78.18 27.03 61.19
🥈 Azure‑Router [Web] 💼 Microsoft 66.66 68.09 $0.54 22.52 46.32 81.96 54.07
🥉 NIRT‑BERT [GH] 🎓 USTC 66.12 66.34 $0.21 3.83 14.04 77.88 10.42 49.29
4 GPT‑5 💼 OpenAI 64.32 73.96 $10.02
5 vLLM‑SR [GH] [HF] 🎓 vLLM SR Team 64.32 67.28 $1.67 4.79 12.54 79.33 0.19 35.00
6 CARROT [GH] [HF] 🎓 UMich 63.87 67.21 $2.06 2.68 6.77 78.63 1.50 89.05
7 Chayan [HF] 🎓 Adaptive Classifier 63.83 64.89 $0.56 43.03 43.75 88.74
8 RouterBench‑MLP [GH] [HF] 🎓 Martian 57.56 61.62 $4.83 13.39 24.45 83.32 90.91 80.00
9 NotDiamond 💼 NotDiamond 57.29 60.83 $4.10 1.55 2.14 76.81 55.91
10 GraphRouter [GH] 🎓 UIUC 57.22 57.00 $0.34 4.73 38.33 74.25 2.70 94.29
11 RouterBench‑KNN [GH] [HF] 🎓 Martian 55.48 58.69 $4.27 13.09 25.49 78.77 1.33 83.33
12 RouteLLM [GH] [HF] 🎓 Berkeley 48.07 47.04 $0.27 99.72 99.63 68.76 0.40 100.00
13 RouterDC [GH] 🎓 SUSTech 33.75 32.01 $0.07 39.84 73.00 49.05 10.75 85.24

🎓 Open-source  💼 Closed-source 

Evaluating Your Router

1. Setup

Step 1.1: Install uv and RouterArena

curl -LsSf https://astral.sh/uv/install.sh | sh
cd RouterArena
uv sync

Step 1.2: Download Dataset

Download the dataset from HF dataset.

uv run python ./scripts/process_datasets/prep_datasets.py

Step 1.3: Set Up API Keys (Optional)

In the project root, copy .env.example as .env and update the API keys in .env. This step is required only if you use our pipeline for LLM inferences.

# Example .env file
OPENAI_API_KEY=<Your-Key>
ANTHROPIC_API_KEY=<Your-Key>
# ...

See the ModelInference class for the complete list of supported providers and required environment variables. You can extend that class to support more models, or submit a GitHub issue to request support for new providers.

2. Get Routing Decisions

Follow the steps below to obtain your router's model choices for each query. Start with the sub_10 split (a 10% subset) for local testing. Once your setup works, you can evaluate on the full dataset for full local evaluation and official leaderboard submission.

Step 2.1: Prepare Config File

Create a config file in ./router_inference/config/<router_name>.json. An example config file is included here.

{
  "pipeline_params": {
      "router_name": "your-router",
      "models": [
          "gpt-4o-mini",
          "claude-3-haiku-20240307",
          "gemini-2.0-flash-001"
      ]
  }
}

For each model in your config, add an entry with the pricing per million tokens in this format at model_cost/cost.json:

{
  "gpt-4o-mini": {
    "input_token_price_per_million": 0.15,
    "output_token_price_per_million": 0.6
  },
}

Note

Ensure all models in your above config files are listed in ./universal_model_names.py. If you add a new model, you must also add the API inference endpoint in llm_inference/model_inference.py.

Step 2.2: Create Your Router Class and Generate Prediction File

Create your own router class by inheriting from BaseRouter and implementing the _get_prediction() method. See router_inference/router/example_router.py for a complete example.

Then, modify router_inference/generate_prediction_file.py to use your router class:

# Replace ExampleRouter with your router class
from router_inference.router.my_router import MyRouter
router = MyRouter(args.router_name)

Finally, generate the prediction file:

uv run python ./router_inference/generate_prediction_file.py your-router [sub_10|full]

Note

  • The <your-router> argument must match your config filename (without the .json extension). For example, if your config file is router_inference/config/my-router.json, use my-router as the argument.
  • Your _get_prediction() method must return a model name that exists in your config file's models list. The base class will automatically validate this.

Step 2.3: Validate Config and Prediction Files

uv run python ./router_inference/check_config_prediction_files.py your-router [sub_10|full]

This script checks: (1) all model names are valid, (2) prediction file has correct size (809 for sub_10, 8400 for full), and (3) all entries have valid global_index, prompt, and prediction fields.

3. Run LLM Inference

Run the inference script to make API calls for each query using the selected models:

uv run python ./llm_inference/run.py your-router

The script loads your prediction file, makes API calls using the models specified in the prediction field, and saves results incrementally. It uses cached results when available and saves progress after each query, so you can safely interrupt and resume. Results are saved to ./cached_results/ for reuse across routers.

4. Run Router Evaluation

As the last step, run the evaluation script:

uv run python ./llm_evaluation/run.py your-router [sub_10|full]

Submitting to the leaderboard

To get your router on the leaderboard, you can open a Pull Request with your router's prediction file to trigger our automated evaluation workflow. Details are as follows:

  1. Add your files:
    • router_inference/config/<router_name>.json - Your router configuration
    • router_inference/predictions/<router_name>.json - Your prediction file with generated_result fields populated
  2. Open a Pull Request to main branch - The automated workflow will:
    • Validate your submission
    • Run evaluation on the full dataset
    • Post results as a comment on your PR
    • Update the leaderboard upon approval

The Figure below shows the evaluation pipeline.

RouterArena Evaluation Pipeline

Contributing

We welcome and appreciate contributions and collaborations of any kind.

We use pre-commit to ensure a consistent coding style. You can set it up by

pip install pre-commit
pre-commit install

Before pushing your code, run the following and make sure your code passes all checks.

pre-commit run --all-files

Contacts

Feel free to contact us for contributions and collaborations.

Yifan Lu (yifan.lu@rice.edu)
Jiarong Xing (jxing@rice.edu)

Citation:

If you find our project helpful, please give us a star and cite us by:

@misc{lu2025routerarenaopenplatformcomprehensive,
  title        = {RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers},
  author       = {Yifan Lu and Rixin Liu and Jiayi Yuan and Xingqi Cui and Shenrun Zhang and Hongyi Liu and Jiarong Xing},
  year         = {2025},
  eprint       = {2510.00202},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  url          = {https://arxiv.org/abs/2510.00202}
}

About

RouterArena: An open framework for evaluating LLM routers with standardized datasets, metrics, an automated framework, and a live leaderboard.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •