Enable accuracy test for all models and add threshold checking#2234
Enable accuracy test for all models and add threshold checking#2234gyohuangxin wants to merge 4 commits intomainfrom
Conversation
- Remove the condition that skipped accuracy test for DeepSeek-R1-0528 - Add accuracy_test_threshold to matrix config for each model - Add "Check accuracy test results" step to validate accuracy against threshold
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
There was a problem hiding this comment.
Pull request overview
Updates the ATOM CI workflow to run accuracy tests for all configured models and enforce per-model minimum accuracy thresholds as part of the job.
Changes:
- Removes the model-specific skip so the ATOM accuracy test runs for all models in the matrix.
- Adds
accuracy_test_thresholdto each model in the workflow matrix. - Adds a post-step that reads the latest accuracy results JSON and fails the job if the configured metric is below the threshold.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - name: Check accuracy test results | ||
| if: (matrix.run_on_pr == true || github.event_name != 'pull_request') && success() | ||
| run: | | ||
| result_file=$(ls -1t accuracy_test_results/*.json 2>/dev/null | head -n 1) | ||
| if [ -z "$result_file" ] || [ ! -f "$result_file" ]; then | ||
| echo "ERROR: No results JSON file found in accuracy_test_results/" | ||
| exit 2 | ||
| else | ||
| echo "RESULT_FILE: $result_file" | ||
| fi | ||
| flexible_extract_value=$(jq '.results.gsm8k["exact_match,flexible-extract"]' "$result_file") | ||
| echo "Flexible extract value: $flexible_extract_value" | ||
| echo "Accuracy test threshold: ${{ matrix.accuracy_test_threshold }}" | ||
|
|
||
| result=$(awk -v val="$flexible_extract_value" -v threshold="${{ matrix.accuracy_test_threshold }}" 'BEGIN {print (val < threshold) ? 1 : 0}') |
There was a problem hiding this comment.
This validation step doesn't enable strict bash options, so failures in jq (missing binary, JSON parse error, missing key) won't reliably fail the step at the point of error. Consider adding set -euo pipefail at the start, using jq -e -r (and/or tonumber) to ensure the metric exists and is numeric, and explicitly failing if matrix.accuracy_test_threshold is empty/unset to avoid accidental passes on future matrix additions.
| echo "========== Running accuracy test ==========" | ||
| docker exec atom_aiter_test bash -lc " | ||
| .github/scripts/atom_test.sh accuracy $model_path | ||
| " | ||
| " 2>&1 | tee atom_accuracy_output.txt |
There was a problem hiding this comment.
tee atom_accuracy_output.txt creates a new log file, but it's never uploaded as an artifact (only atom_test_output.txt is). Either include atom_accuracy_output.txt (and/or the JSON results) in the upload step for easier debugging on failures, or remove the file write if it's not intended to be consumed.
Accuracy test has exclusive GPU access, the previous 0.3 limit was too conservative and caused server timeouts under concurrent lm_eval requests.
Summary
accuracy_test_thresholdto matrix config for each model (DeepSeek-R1-0528: 0.94, gpt-oss-120b: 0.38), consistent with ATOM repoexact_match,flexible-extract) against the configured thresholdTest plan