Skip to content

Enable accuracy test for all models and add threshold checking#2234

Open
gyohuangxin wants to merge 4 commits intomainfrom
fix/atom-accuracy-test-threshold
Open

Enable accuracy test for all models and add threshold checking#2234
gyohuangxin wants to merge 4 commits intomainfrom
fix/atom-accuracy-test-threshold

Conversation

@gyohuangxin
Copy link
Member

Summary

  • Remove the condition that skipped accuracy test for DeepSeek-R1-0528, so accuracy test now runs for all models
  • Add accuracy_test_threshold to matrix config for each model (DeepSeek-R1-0528: 0.94, gpt-oss-120b: 0.38), consistent with ATOM repo
  • Add "Check accuracy test results" step to validate accuracy score (exact_match,flexible-extract) against the configured threshold

Test plan

  • Verify accuracy test runs for DeepSeek-R1-0528 (previously skipped)
  • Verify accuracy test runs for gpt-oss-120b (unchanged behavior)
  • Verify threshold check correctly fails when score is below threshold
  • Verify threshold check correctly passes when score meets or exceeds threshold

- Remove the condition that skipped accuracy test for DeepSeek-R1-0528
- Add accuracy_test_threshold to matrix config for each model
- Add "Check accuracy test results" step to validate accuracy against threshold
@gyohuangxin gyohuangxin requested review from a team and Copilot March 10, 2026 04:53
@github-actions
Copy link
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:multi-gpu Multi-GPU op tests (8 GPU)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2234 --add-label <label>

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the ATOM CI workflow to run accuracy tests for all configured models and enforce per-model minimum accuracy thresholds as part of the job.

Changes:

  • Removes the model-specific skip so the ATOM accuracy test runs for all models in the matrix.
  • Adds accuracy_test_threshold to each model in the workflow matrix.
  • Adds a post-step that reads the latest accuracy results JSON and fails the job if the configured metric is below the threshold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +235 to +249
- name: Check accuracy test results
if: (matrix.run_on_pr == true || github.event_name != 'pull_request') && success()
run: |
result_file=$(ls -1t accuracy_test_results/*.json 2>/dev/null | head -n 1)
if [ -z "$result_file" ] || [ ! -f "$result_file" ]; then
echo "ERROR: No results JSON file found in accuracy_test_results/"
exit 2
else
echo "RESULT_FILE: $result_file"
fi
flexible_extract_value=$(jq '.results.gsm8k["exact_match,flexible-extract"]' "$result_file")
echo "Flexible extract value: $flexible_extract_value"
echo "Accuracy test threshold: ${{ matrix.accuracy_test_threshold }}"

result=$(awk -v val="$flexible_extract_value" -v threshold="${{ matrix.accuracy_test_threshold }}" 'BEGIN {print (val < threshold) ? 1 : 0}')
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation step doesn't enable strict bash options, so failures in jq (missing binary, JSON parse error, missing key) won't reliably fail the step at the point of error. Consider adding set -euo pipefail at the start, using jq -e -r (and/or tonumber) to ensure the metric exists and is numeric, and explicitly failing if matrix.accuracy_test_threshold is empty/unset to avoid accidental passes on future matrix additions.

Copilot uses AI. Check for mistakes.
Comment on lines 230 to +233
echo "========== Running accuracy test =========="
docker exec atom_aiter_test bash -lc "
.github/scripts/atom_test.sh accuracy $model_path
"
" 2>&1 | tee atom_accuracy_output.txt
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tee atom_accuracy_output.txt creates a new log file, but it's never uploaded as an artifact (only atom_test_output.txt is). Either include atom_accuracy_output.txt (and/or the JSON results) in the upload step for easier debugging on failures, or remove the file write if it's not intended to be consumed.

Copilot uses AI. Check for mistakes.
Accuracy test has exclusive GPU access, the previous 0.3 limit was
too conservative and caused server timeouts under concurrent lm_eval
requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants