Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
c7d1454
Add GAIA eval_infer for unified evaluation workflow
openhands-agent Dec 2, 2025
2f430c2
Add remote workspace support to GAIA evaluation
simonrosenberg Dec 3, 2025
58f299a
Fix GAIA evaluation to use workspace_type from command-line argument
simonrosenberg Dec 3, 2025
54c13a6
Add LocalWorkspace support for GAIA evaluations
simonrosenberg Dec 3, 2025
02fd3ba
Add 'local' to workspace type choices in argument parser
simonrosenberg Dec 3, 2025
7254a69
Add 'local' to workspace_type Literal in EvalMetadata model
simonrosenberg Dec 3, 2025
a7d0246
Add GAIA agent server image build and use remote workspace
simonrosenberg Dec 3, 2025
35b3985
Add compatibility inputs to GAIA build workflow
simonrosenberg Dec 3, 2025
2f93f40
Fix SWE-bench to use DockerDevWorkspace for base_image/target
simonrosenberg Dec 3, 2025
1a56035
Fix pyright type error in load_hf_dataset.py
simonrosenberg Dec 3, 2025
33a83f1
Simplify GAIA build workflow for single-image architecture
simonrosenberg Dec 3, 2025
b2e1e54
Rename workflow to singular: build-gaia-image.yml
simonrosenberg Dec 3, 2025
6822653
Fix YAML syntax errors in build-gaia-image.yml
openhands-agent Dec 4, 2025
f794aba
Fix pickle error in GAIA build by replacing lambda with regular function
openhands-agent Dec 4, 2025
f679baa
Add fallback Docker buildx setup when Blacksmith fails
openhands-agent Dec 4, 2025
0d61966
Always set up docker-container builder as fallback
openhands-agent Dec 4, 2025
ae4d2ff
[TEMPORARY] Disable Tavily requirement for GAIA testing
openhands-agent Dec 4, 2025
b0afa1f
Add format_report.py formatters for GAIA and SWE-bench
openhands-agent Dec 4, 2025
def87bc
Update formatters to read output.jsonl and report.json directly
openhands-agent Dec 4, 2025
9468988
Fix GAIA timeout issues by pre-installing MCP server
openhands-agent Dec 4, 2025
89b77c6
Add next steps documentation for MCP fix
openhands-agent Dec 4, 2025
0120ad6
Merge main into feature branch - resolve conflicts
openhands-agent Dec 4, 2025
4db1e81
Revert temporary Tavily disable - restore full functionality
openhands-agent Dec 4, 2025
9e13bb2
Add comprehensive workflow status documentation
openhands-agent Dec 4, 2025
13af333
Add MCP-enhanced image build to GAIA workflow
openhands-agent Dec 4, 2025
7170493
Add workflow run summary documentation
openhands-agent Dec 4, 2025
e814424
Refresh workflow cache - add descriptive comment
openhands-agent Dec 4, 2025
a153769
Remove redundant build-gaia-mcp-image.yml workflow
openhands-agent Dec 4, 2025
5749490
Force workflow cache refresh
openhands-agent Dec 4, 2025
e8ac276
Rename workflow to avoid GitHub Actions cache issue
openhands-agent Dec 4, 2025
b362d9d
Fix YAML syntax error: collapse multi-line Python code
openhands-agent Dec 4, 2025
60d6e17
Fix YAML syntax: replace heredoc with direct string assignment
openhands-agent Dec 4, 2025
cbe5f81
Fix YAML syntax: use jq for comment body to avoid multi-line string i…
openhands-agent Dec 4, 2025
51bd224
Add fallback Docker Buildx setup when Blacksmith fails
openhands-agent Dec 4, 2025
c5cc86c
Replace Blacksmith with standard Docker Buildx setup
openhands-agent Dec 4, 2025
6614af7
Fix GAIA evaluation: Use binary target instead of binary-minimal to i…
openhands-agent Dec 5, 2025
98e0965
Remove unnecessary documentation files
openhands-agent Dec 5, 2025
02f31bc
Remove unused code and fix workflow options
openhands-agent Dec 5, 2025
852b64c
Merge main into openhands/multi-benchmark-eval-support
openhands-agent Dec 5, 2025
7ee9082
Fix trailing whitespace in format_report.py files
openhands-agent Dec 5, 2025
e076657
Revert swt_bench/run_infer.py to main version - no functional changes…
openhands-agent Dec 5, 2025
bc65fbc
Remove outdated workflow comment
openhands-agent Dec 5, 2025
2af2bf8
Add docker workspace support to GAIA evaluation
openhands-agent Dec 5, 2025
1a812ea
Fix GAIA workspace: keep docker mode behavior same as main, only add …
openhands-agent Dec 6, 2025
4c8a9d6
Update SDK submodule to latest main (693c3261) to match evaluation ru…
openhands-agent Dec 6, 2025
f5d612f
Fix Browser action deserialization by using OpenHandsModel
openhands-agent Dec 6, 2025
233b79e
improve: enhance agent output extraction and ffmpeg installation
openhands-agent Dec 7, 2025
9533cf2
Fix critical logging bug: Failed instances now properly tracked and r…
openhands-agent Dec 7, 2025
b5cf5d9
[TEMPORARY] Hardcode 2 failed GAIA task IDs for debugging
openhands-agent Dec 7, 2025
b3ab6f5
Add flexible instance_ids parameter to GAIA evaluation
openhands-agent Dec 8, 2025
12da617
Fix error counting in GAIA evaluation
openhands-agent Dec 8, 2025
5bbb4d3
Fix pre-commit issues: formatting and type checking
openhands-agent Dec 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,6 @@ on:
description: 'Software Agent SDK commit/ref to use'
required: true
type: string
target:
description: 'Build target (default: binary-minimal)'
required: false
default: 'binary-minimal'
type: choice
options:
- binary-minimal
- source-minimal

concurrency:
group: build-gaia-${{ github.ref }}
Expand Down Expand Up @@ -65,8 +57,8 @@ jobs:
git add vendor/software-agent-sdk
echo "Updated SDK submodule to $SDK_SHA (from ${{ inputs.sdk-commit }})"

- name: Set up Docker Buildx with Blacksmith
uses: useblacksmith/setup-docker-builder@v1
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Log in to GitHub Container Registry
uses: docker/login-action@v3
Expand All @@ -88,7 +80,8 @@ jobs:
run: |
set -euo pipefail

TARGET="${{ inputs.target || 'binary-minimal' }}"
# GAIA requires 'binary' target to include Chromium for browser operations
TARGET="binary"

CMD="uv run benchmarks/gaia/build_images.py \
--image ghcr.io/openhands/eval-agent-server \
Expand All @@ -101,6 +94,39 @@ jobs:
DOCKER_BUILDKIT: 1
BUILDKIT_PROGRESS: plain

- name: Build and push GAIA image with MCP pre-installed
run: |
set -euo pipefail

# Get the SDK commit SHA for tagging
SDK_SHA=$(git submodule status vendor/software-agent-sdk | awk '{print $1}' | sed 's/^[+-]//' | cut -c1-7)

# GAIA requires 'binary' target to include Chromium for browser operations
TARGET="binary"

# Compute base and MCP image tags
BASE_IMAGE="ghcr.io/openhands/eval-agent-server:${SDK_SHA}-gaia"
MCP_IMAGE="ghcr.io/openhands/eval-agent-server:${SDK_SHA}-gaia-with-mcp"

echo "Building MCP-enhanced image..."
echo " Base image: ${BASE_IMAGE}"
echo " MCP image: ${MCP_IMAGE}"

# Build the derived image with MCP pre-cached
docker build \
-f benchmarks/gaia/Dockerfile.gaia \
--build-arg SDK_IMAGE="${BASE_IMAGE}" \
-t "${MCP_IMAGE}" \
.

# Push the image
docker push "${MCP_IMAGE}"

echo "✅ MCP-enhanced image built and pushed: ${MCP_IMAGE}"
env:
DOCKER_BUILDKIT: 1
BUILDKIT_PROGRESS: plain

- name: Archive build logs
if: always()
run: |
Expand Down Expand Up @@ -157,6 +183,7 @@ jobs:
run: |
# Get SDK version
SDK_SHA=$(git submodule status vendor/software-agent-sdk | awk '{print $1}' | sed 's/^[+-]//')
SDK_SHA_SHORT=${SDK_SHA:0:7}

# Read the single manifest file
MANIFEST_FILE=$(find builds -name "manifest.jsonl" -type f 2>/dev/null | head -1 || true)
Expand All @@ -167,18 +194,16 @@ jobs:
fi

# Extract the image tag from the manifest
IMAGE_TAG=$(cat "$MANIFEST_FILE" | python3 -c "
import sys, json
data = json.loads(sys.stdin.read())
tags = data.get('tags', [])
print(tags[0] if tags else 'unknown')
")
IMAGE_TAG=$(cat "$MANIFEST_FILE" | python3 -c "import sys, json; data = json.loads(sys.stdin.read()); tags = data.get('tags', []); print(tags[0] if tags else 'unknown')")

if [ "$IMAGE_TAG" = "unknown" ]; then
echo "No valid image tag found in manifest"
exit 0
fi

# Construct MCP image tag (always binary for GAIA)
MCP_IMAGE_TAG="ghcr.io/openhands/eval-agent-server:${SDK_SHA_SHORT}-gaia-with-mcp"

# Determine trigger source
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
TRIGGER="Manual trigger (workflow_dispatch)"
Expand All @@ -188,22 +213,23 @@ print(tags[0] if tags else 'unknown')
TRIGGER="${{ github.event_name }}"
fi

# Post comment
COMMENT_BODY=$(cat <<EOF
## GAIA Image Build Complete ✅

**SDK Version:** [\`${SDK_SHA:0:7}\`](https://github.com/OpenHands/software-agent-sdk/commit/${SDK_SHA})
**Image Tag:** \`${IMAGE_TAG}\`
**Workflow Run:** [#${{ github.run_id }}](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }})
**Triggered by:** ${TRIGGER}
EOF
)

# Post comment using jq to properly handle multi-line content
jq -n \
--arg sdk_short "${SDK_SHA_SHORT}" \
--arg sdk_full "${SDK_SHA}" \
--arg image "${IMAGE_TAG}" \
--arg mcp_image "${MCP_IMAGE_TAG}" \
--arg run_id "${{ github.run_id }}" \
--arg server_url "${{ github.server_url }}" \
--arg repo "${{ github.repository }}" \
--arg trigger "${TRIGGER}" \
'{body: "## GAIA Image Build Complete ✅\n\n**SDK Version:** [`\($sdk_short)`](https://github.com/OpenHands/software-agent-sdk/commit/\($sdk_full))\n**Base Image:** `\($image)`\n**MCP Image:** `\($mcp_image)` ⚡ _(MCP server pre-cached)_\n**Workflow Run:** [#\($run_id)](\($server_url)/\($repo)/actions/runs/\($run_id))\n**Triggered by:** \($trigger)\n\nThe MCP-enhanced image includes pre-cached `mcp-server-fetch` to eliminate 1-18 minute startup delays."}' | \
curl -L -X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer ${{ secrets.GITHUB_TOKEN }}" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-H "Content-Type: application/json" \
"${{ github.api_url }}/repos/${{ github.repository }}/issues/81/comments" \
-d "$(jq -n --arg body "$COMMENT_BODY" '{body: $body}')"
-d @-
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
17 changes: 17 additions & 0 deletions benchmarks/gaia/Dockerfile.gaia
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Dockerfile for GAIA evaluation with MCP server pre-installed
# Extends the base SDK image to pre-cache mcp-server-fetch and eliminate startup delays

ARG SDK_IMAGE=ghcr.io/openhands/eval-agent-server:f715937-gaia-binary-minimal
FROM ${SDK_IMAGE}

# Switch to root to install packages
USER root

# Pre-install MCP server to avoid 1-18 minute startup delays during agent initialization
# This caches the mcp-server-fetch package so uvx can start it instantly at runtime
RUN uvx mcp-server-fetch --version 2>&1 || echo "MCP server cached"

# Switch back to openhands user
USER openhands

# Inherit all other settings from base image
61 changes: 61 additions & 0 deletions benchmarks/gaia/build_images.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#!/usr/bin/env python3
"""
Build a universal agent-server image for GAIA benchmark.

Unlike SWE-bench which requires per-instance images with specific repository environments,
GAIA uses a single universal image for all instances since they share the same Python+Node.js environment.

Example:
uv run benchmarks/gaia/build_images.py \
--image ghcr.io/openhands/eval-agent-server --target binary-minimal --push
"""

import sys

from benchmarks.utils.build_utils import (
build_all_images,
default_build_output_dir,
get_build_parser,
)
from openhands.sdk import get_logger


logger = get_logger(__name__)

# GAIA base image: Python 3.12 + Node.js 22 (default for agent server)
GAIA_BASE_IMAGE = "nikolaik/python-nodejs:python3.12-nodejs22"


def gaia_tag_fn(base_image: str) -> str:
"""Return custom tag for GAIA images (all use 'gaia' tag)."""
return "gaia"


def main(argv: list[str]) -> int:
parser = get_build_parser()
args = parser.parse_args(argv)

# GAIA only needs one universal image for all instances
base_images = [GAIA_BASE_IMAGE]

logger.info(f"Building GAIA agent server image from base: {GAIA_BASE_IMAGE}")
logger.info(f"Target: {args.target}")
logger.info(f"Image: {args.image}")
logger.info(f"Push: {args.push}")

build_dir = default_build_output_dir("gaia", "validation")
return build_all_images(
base_images=base_images,
target=args.target,
build_dir=build_dir,
image=args.image,
push=args.push,
max_workers=1, # Only building one image
dry_run=args.dry_run,
max_retries=args.max_retries,
base_image_to_custom_tag_fn=gaia_tag_fn, # Tag all with "gaia"
)


if __name__ == "__main__":
sys.exit(main(sys.argv[1:]))
Loading