refactor(ci): reorder and rename GPU workflow steps#548
Open
yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
Open
refactor(ci): reorder and rename GPU workflow steps#548yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
e8ccd16 to
2d0062c
Compare
4ded55d to
dc303a8
Compare
dc303a8 to
91f88c9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Standardize GPU workflow step ordering and naming across all three H100 workflows to use a consistent canonical order: resource verification → Karpenter setup → health checks → conformance validation → (intent-specific tests) → artifact upload.
Motivation / Context
The GPU workflow step ordering evolved organically, resulting in inconsistent ordering across workflows (training had chainsaw→validate, conformance/inference had validate→chainsaw), misleading step names, and the resource existence check running last instead of first. This PR standardizes all three workflows to the same logical progression.
Related: #541
Type of Change
Component(s) Affected
.github/workflows/gpu-h100-{conformance,training,inference}-test.yamlImplementation Notes
Canonical step order (consistent across all 3 workflows):
Inference ordering constraint: Conformance validation must run before Dynamo deployment because the
dra-supportcheck allocates a GPU via DRA ResourceClaim, and the Dynamo vLLM worker also consumes a GPU claim. On H100 x1 (single GPU), running Dynamo first would causedra-supportto fail with "cannot allocate all claims." This was verified when the initial version of this PR caused adra-supportfailure on inference.Step renames:
Collect AI conformance evidence→Check expected resources exist— it checks resource existence, not conformanceValidate cluster→Validate CNCF AI Conformance— clarifies this is the behavioral conformance validationUpload conformance evidence→Collect and upload validation artifacts— accurate description of what's uploadedLoad versions→Prepare chainsaw— clarifies this is chainsaw setup plumbingInstall Karpenter + KWOK (setup)→Install Karpenter + KWOK— removed redundant "(setup)"Behavioral change: This PR intentionally changes the step order for conformance and inference workflows. Previously, conformance/inference ran
aicr validatebefore chainsaw, so conformance signal was preserved even if chainsaw flaked on monitoring assertions (e.g., grafana availability). In the new order, chainsaw runs before validation in all workflows, which means a chainsaw flake now gates conformance validation. This tradeoff is intentional — it provides a consistent, logical ordering where health checks precede behavioral tests, and gives the monitoring stack more settle time before conformance checks that depend on it.Testing
YAML-only workflow changes. Validated by:
dra-supportregression caught and fixed by moving conformance before DynamoRisk Assessment
Risks:
Rollout notes: N/A — CI workflow changes only.
Checklist
make testwith-race)make lint)git commit -S)