refactor(ci): reorder and rename GPU workflow steps by yuanchen8911 · Pull Request #548 · NVIDIA/aicr

yuanchen8911 · 2026-04-12T04:03:36Z

Summary

Standardize GPU workflow step ordering and naming across all three H100 workflows to use a consistent canonical order: resource verification → Karpenter setup → health checks → conformance validation → (intent-specific tests) → artifact upload.

Motivation / Context

The GPU workflow step ordering evolved organically, resulting in inconsistent ordering across workflows (training had chainsaw→validate, conformance/inference had validate→chainsaw), misleading step names, and the resource existence check running last instead of first. This PR standardizes all three workflows to the same logical progression.

Related: #541

Type of Change

Build/CI/tooling

Component(s) Affected

Other: .github/workflows/gpu-h100-{conformance,training,inference}-test.yaml

Implementation Notes

Canonical step order (consistent across all 3 workflows):

#	Step	Purpose
1	Snapshot and validate GPU	Capture cluster state, verify GPU detection
2	Check expected resources exist	Fast inventory pre-check (~10s), fail early if bundle is incomplete
3	Install Karpenter + KWOK	Setup for cluster-autoscaling check; also provides ~9 min settle time for monitoring stack
4	Prepare + Install + Run chainsaw health checks	Deployment health/readiness assertions
5	Validate CNCF AI Conformance	Behavioral conformance checks (needs bootstrapped metrics pipeline and free GPU for dra-support)
6	(Inference only) Deploy + Validate Dynamo inference	Intent-specific smoke test (consumes GPU via DRA ResourceClaim)
7	Collect and upload validation artifacts	Package validation-result.yaml + conformance-evidence/

Inference ordering constraint: Conformance validation must run before Dynamo deployment because the dra-support check allocates a GPU via DRA ResourceClaim, and the Dynamo vLLM worker also consumes a GPU claim. On H100 x1 (single GPU), running Dynamo first would cause dra-support to fail with "cannot allocate all claims." This was verified when the initial version of this PR caused a dra-support failure on inference.

Step renames:

Collect AI conformance evidence → Check expected resources exist — it checks resource existence, not conformance
Validate cluster → Validate CNCF AI Conformance — clarifies this is the behavioral conformance validation
Upload conformance evidence → Collect and upload validation artifacts — accurate description of what's uploaded
Load versions → Prepare chainsaw — clarifies this is chainsaw setup plumbing
Install Karpenter + KWOK (setup) → Install Karpenter + KWOK — removed redundant "(setup)"

Behavioral change: This PR intentionally changes the step order for conformance and inference workflows. Previously, conformance/inference ran aicr validate before chainsaw, so conformance signal was preserved even if chainsaw flaked on monitoring assertions (e.g., grafana availability). In the new order, chainsaw runs before validation in all workflows, which means a chainsaw flake now gates conformance validation. This tradeoff is intentional — it provides a consistent, logical ordering where health checks precede behavioral tests, and gives the monitoring stack more settle time before conformance checks that depend on it.

Testing

YAML-only workflow changes. Validated by:

yamllint on all 3 workflow files
Manual review of step dependencies and DRA resource constraints
Inference dra-support regression caught and fixed by moving conformance before Dynamo

Risk Assessment

Low — Easy to revert

Risks:

Chainsaw health check flakes (e.g., grafana availability on slow H100 x2 runners) will now block conformance validation in workflows where they previously didn't. Mitigation: the additional ~9 min from Karpenter install before chainsaw provides more settle time.

Rollout notes: N/A — CI workflow changes only.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

yuanchen8911 added the area/ci label Apr 12, 2026

yuanchen8911 requested a review from a team as a code owner April 12, 2026 04:03

yuanchen8911 added area/tests area/ci labels Apr 12, 2026

github-actions bot added size/L and removed area/tests labels Apr 12, 2026

yuanchen8911 force-pushed the refactor/gpu-workflow-step-ordering branch from e8ccd16 to 2d0062c Compare April 12, 2026 04:11

github-actions bot added size/M and removed size/L labels Apr 12, 2026

yuanchen8911 force-pushed the refactor/gpu-workflow-step-ordering branch 3 times, most recently from 4ded55d to dc303a8 Compare April 12, 2026 04:38

refactor(ci): reorder and rename GPU workflow steps for clarity

91f88c9

yuanchen8911 force-pushed the refactor/gpu-workflow-step-ordering branch from dc303a8 to 91f88c9 Compare April 12, 2026 05:03

yuanchen8911 requested review from mchmarny and xdu31 April 12, 2026 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(ci): reorder and rename GPU workflow steps#548

refactor(ci): reorder and rename GPU workflow steps#548
yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
yuanchen8911:refactor/gpu-workflow-step-ordering

yuanchen8911 commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yuanchen8911 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuanchen8911 commented Apr 12, 2026 •

edited

Loading