Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Dec 2, 2025

What does this PR do?

This PR adds the ability to select which benchmark to evaluate (SWE-bench or GAIA) through the workflow dispatch UI.

Changes

Workflow Configuration (run-eval.yml)

  • Add benchmark input parameter with dropdown choices: swebench (default), gaia
  • Pass benchmark parameter to the evaluation workflow dispatch
  • Remove hardcoded DATASET and SPLIT environment variables (now set dynamically in evaluation repo)
  • Update dispatch logging to show selected benchmark

Usage

When manually triggering the evaluation workflow:

  1. Go to Actions → Run Eval → Run workflow
  2. Select the desired benchmark from the dropdown (defaults to swebench)
  3. Configure other parameters as usual (model, eval limit, etc.)

The workflow will automatically:

  • Use the correct dataset for the selected benchmark
  • Run the appropriate inference script (swebench-infer or gaia-infer)
  • Run the appropriate evaluation method
  • Skip unnecessary steps (e.g., no Docker builds for GAIA)

Related PRs

This is part of a multi-repo change to enable evaluation on multiple benchmarks:

Testing

The changes maintain backward compatibility:

  • Default behavior (SWE-bench) is unchanged
  • Existing workflows and triggers continue to work
  • GAIA support is opt-in through the dropdown selection

Fixes #1293

@simonrosenberg can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:409ed81-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-409ed81-python \
  ghcr.io/openhands/agent-server:409ed81-python

All tags pushed for this build

ghcr.io/openhands/agent-server:409ed81-golang-amd64
ghcr.io/openhands/agent-server:409ed81-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:409ed81-golang-arm64
ghcr.io/openhands/agent-server:409ed81-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:409ed81-java-amd64
ghcr.io/openhands/agent-server:409ed81-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:409ed81-java-arm64
ghcr.io/openhands/agent-server:409ed81-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:409ed81-python-amd64
ghcr.io/openhands/agent-server:409ed81-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:409ed81-python-arm64
ghcr.io/openhands/agent-server:409ed81-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:409ed81-golang
ghcr.io/openhands/agent-server:409ed81-java
ghcr.io/openhands/agent-server:409ed81-python

About Multi-Architecture Support

  • Each variant tag (e.g., 409ed81-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 409ed81-python-amd64) are also available if needed

- Add benchmark input parameter with choices: swebench (default), gaia
- Pass benchmark parameter to evaluation workflow dispatch
- Remove hardcoded DATASET/SPLIT (now set dynamically in evaluation repo)
- Enables evaluation on multiple benchmarks through workflow dispatch UI

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Dec 5, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run Eval
    • Run Eval
    • Run Eval
    • Run Eval
    • Run Eval
    • Run Eval
    • Run Eval

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1294 at branch `openhands/multi-benchmark-eval-support`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make evaluation eval workflow compatible with multiple benchmarks

3 participants