Enable accuracy test for all models and add threshold checking by gyohuangxin · Pull Request #2234 · ROCm/aiter

gyohuangxin · 2026-03-10T04:53:41Z

Summary

Remove the condition that skipped accuracy test for DeepSeek-R1-0528, so accuracy test now runs for all models
Add accuracy_test_threshold to matrix config for each model (DeepSeek-R1-0528: 0.94, gpt-oss-120b: 0.38), consistent with ATOM repo
Add "Check accuracy test results" step to validate accuracy score (exact_match,flexible-extract) against the configured threshold

Test plan

Verify accuracy test runs for DeepSeek-R1-0528 (previously skipped)
Verify accuracy test runs for gpt-oss-120b (unchanged behavior)
Verify threshold check correctly fails when score is below threshold
Verify threshold check correctly passes when score meets or exceeds threshold

- Remove the condition that skipped accuracy test for DeepSeek-R1-0528 - Add accuracy_test_threshold to matrix config for each model - Add "Check accuracy test results" step to validate accuracy against threshold

github-actions · 2026-03-10T04:53:56Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:multi-gpu`	Multi-GPU op tests (8 GPU)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2234 --add-label <label>

Copilot

Pull request overview

Updates the ATOM CI workflow to run accuracy tests for all configured models and enforce per-model minimum accuracy thresholds as part of the job.

Changes:

Removes the model-specific skip so the ATOM accuracy test runs for all models in the matrix.
Adds accuracy_test_threshold to each model in the workflow matrix.
Adds a post-step that reads the latest accuracy results JSON and fails the job if the configured metric is below the threshold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T04:58:16Z

.github/workflows/atom-test.yaml

+      - name: Check accuracy test results
+        if: (matrix.run_on_pr == true || github.event_name != 'pull_request') && success()
+        run: |
+          result_file=$(ls -1t accuracy_test_results/*.json 2>/dev/null | head -n 1)
+          if [ -z "$result_file" ] || [ ! -f "$result_file" ]; then
+              echo "ERROR: No results JSON file found in accuracy_test_results/"
+              exit 2
+          else
+              echo "RESULT_FILE: $result_file"
+          fi
+          flexible_extract_value=$(jq '.results.gsm8k["exact_match,flexible-extract"]' "$result_file")
+          echo "Flexible extract value: $flexible_extract_value"
+          echo "Accuracy test threshold: ${{ matrix.accuracy_test_threshold }}"
+
+          result=$(awk -v val="$flexible_extract_value" -v threshold="${{ matrix.accuracy_test_threshold }}" 'BEGIN {print (val < threshold) ? 1 : 0}')


This validation step doesn't enable strict bash options, so failures in jq (missing binary, JSON parse error, missing key) won't reliably fail the step at the point of error. Consider adding set -euo pipefail at the start, using jq -e -r (and/or tonumber) to ensure the metric exists and is numeric, and explicitly failing if matrix.accuracy_test_threshold is empty/unset to avoid accidental passes on future matrix additions.

Copilot · 2026-03-10T04:58:16Z

.github/workflows/atom-test.yaml

          echo "========== Running accuracy test =========="
          docker exec atom_aiter_test bash -lc "
            .github/scripts/atom_test.sh accuracy $model_path
-          "
+          " 2>&1 | tee atom_accuracy_output.txt


tee atom_accuracy_output.txt creates a new log file, but it's never uploaded as an artifact (only atom_test_output.txt is). Either include atom_accuracy_output.txt (and/or the JSON results) in the upload step for easier debugging on failures, or remove the file write if it's not intended to be consumed.

Accuracy test has exclusive GPU access, the previous 0.3 limit was too conservative and caused server timeouts under concurrent lm_eval requests.

Enable accuracy test for all models and add threshold checking

5b67e5e

- Remove the condition that skipped accuracy test for DeepSeek-R1-0528 - Add accuracy_test_threshold to matrix config for each model - Add "Check accuracy test results" step to validate accuracy against threshold

gyohuangxin requested review from a team and Copilot March 10, 2026 04:53

gyohuangxin added the ci:atom label Mar 10, 2026

Copilot started reviewing on behalf of gyohuangxin March 10, 2026 04:54 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

gyohuangxin added 3 commits March 10, 2026 13:49

Skip simple inference in CI, rely on accuracy test only

10fb516

Increase gpu-memory-utilization from 0.3 to 0.8

0a53744

Accuracy test has exclusive GPU access, the previous 0.3 limit was too conservative and caused server timeouts under concurrent lm_eval requests.

Change gpu-memory-utilization from 0.8 to 0.5

eb83e09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable accuracy test for all models and add threshold checking#2234

Enable accuracy test for all models and add threshold checking#2234
gyohuangxin wants to merge 4 commits intomainfrom
fix/atom-accuracy-test-threshold

gyohuangxin commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gyohuangxin commented Mar 10, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Mar 10, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants