Add benchmark selection parameter to evaluation workflow #1294

simonrosenberg · 2025-12-02T16:40:43Z

What does this PR do?

This PR adds the ability to select which benchmark to evaluate (SWE-bench or GAIA) through the workflow dispatch UI.

Changes

Workflow Configuration (`run-eval.yml`)

Add benchmark input parameter with dropdown choices: swebench (default), gaia
Pass benchmark parameter to the evaluation workflow dispatch
Remove hardcoded DATASET and SPLIT environment variables (now set dynamically in evaluation repo)
Update dispatch logging to show selected benchmark

Usage

When manually triggering the evaluation workflow:

Go to Actions → Run Eval → Run workflow
Select the desired benchmark from the dropdown (defaults to swebench)
Configure other parameters as usual (model, eval limit, etc.)

The workflow will automatically:

Use the correct dataset for the selected benchmark
Run the appropriate inference script (swebench-infer or gaia-infer)
Run the appropriate evaluation method
Skip unnecessary steps (e.g., no Docker builds for GAIA)

Related PRs

This is part of a multi-repo change to enable evaluation on multiple benchmarks:

OpenHands/benchmarks: Add GAIA eval_infer for unified evaluation workflow benchmarks#125
OpenHands/evaluation: https://github.com/OpenHands/evaluation/pull/56

Testing

The changes maintain backward compatibility:

Default behavior (SWE-bench) is unchanged
Existing workflows and triggers continue to work
GAIA support is opt-in through the dropdown selection

Fixes #1293

@simonrosenberg can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:409ed81-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-409ed81-python \
  ghcr.io/openhands/agent-server:409ed81-python

All tags pushed for this build

ghcr.io/openhands/agent-server:409ed81-golang-amd64
ghcr.io/openhands/agent-server:409ed81-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:409ed81-golang-arm64
ghcr.io/openhands/agent-server:409ed81-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:409ed81-java-amd64
ghcr.io/openhands/agent-server:409ed81-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:409ed81-java-arm64
ghcr.io/openhands/agent-server:409ed81-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:409ed81-python-amd64
ghcr.io/openhands/agent-server:409ed81-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:409ed81-python-arm64
ghcr.io/openhands/agent-server:409ed81-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:409ed81-golang
ghcr.io/openhands/agent-server:409ed81-java
ghcr.io/openhands/agent-server:409ed81-python

About Multi-Architecture Support

Each variant tag (e.g., 409ed81-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 409ed81-python-amd64) are also available if needed

- Add benchmark input parameter with choices: swebench (default), gaia - Pass benchmark parameter to evaluation workflow dispatch - Remove hardcoded DATASET/SPLIT (now set dynamically in evaluation repo) - Enables evaluation on multiple benchmarks through workflow dispatch UI Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2025-12-05T15:16:13Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run Eval
- Run Eval
- Run Eval
- Run Eval
- Run Eval
- Run Eval
- Run Eval

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1294 at branch `openhands/multi-benchmark-eval-support`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

openhands-ai bot mentioned this pull request Dec 2, 2025

Make evaluation eval workflow compatible with multiple benchmarks #1293

Open

simonrosenberg mentioned this pull request Dec 3, 2025

Add GAIA eval_infer for unified evaluation workflow OpenHands/benchmarks#125

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add benchmark selection parameter to evaluation workflow #1294

Add benchmark selection parameter to evaluation workflow #1294

Uh oh!

simonrosenberg commented Dec 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

openhands-ai bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add benchmark selection parameter to evaluation workflow #1294

Are you sure you want to change the base?

Add benchmark selection parameter to evaluation workflow #1294

Uh oh!

Conversation

simonrosenberg commented Dec 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changes

Workflow Configuration (run-eval.yml)

Usage

Related PRs

Testing

Uh oh!

openhands-ai bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simonrosenberg commented Dec 2, 2025 •

edited by github-actions bot

Loading

Workflow Configuration (`run-eval.yml`)