Add benchmark selection parameter to evaluation workflow #1294
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR adds the ability to select which benchmark to evaluate (SWE-bench or GAIA) through the workflow dispatch UI.
Changes
Workflow Configuration (
run-eval.yml)benchmarkinput parameter with dropdown choices:swebench(default),gaiabenchmarkparameter to the evaluation workflow dispatchDATASETandSPLITenvironment variables (now set dynamically in evaluation repo)Usage
When manually triggering the evaluation workflow:
swebench)The workflow will automatically:
swebench-inferorgaia-infer)Related PRs
This is part of a multi-repo change to enable evaluation on multiple benchmarks:
Testing
The changes maintain backward compatibility:
Fixes #1293
@simonrosenberg can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:409ed81-pythonRun
All tags pushed for this build
About Multi-Architecture Support
409ed81-python) is a multi-arch manifest supporting both amd64 and arm64409ed81-python-amd64) are also available if needed