Skip to content

refactor(ci): reorder and rename GPU workflow steps#548

Open
yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
yuanchen8911:refactor/gpu-workflow-step-ordering
Open

refactor(ci): reorder and rename GPU workflow steps#548
yuanchen8911 wants to merge 1 commit intoNVIDIA:mainfrom
yuanchen8911:refactor/gpu-workflow-step-ordering

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 12, 2026

Summary

Standardize GPU workflow step ordering and naming across all three H100 workflows to use a consistent canonical order: resource verification → Karpenter setup → health checks → conformance validation → (intent-specific tests) → artifact upload.

Motivation / Context

The GPU workflow step ordering evolved organically, resulting in inconsistent ordering across workflows (training had chainsaw→validate, conformance/inference had validate→chainsaw), misleading step names, and the resource existence check running last instead of first. This PR standardizes all three workflows to the same logical progression.

Related: #541

Type of Change

  • Build/CI/tooling

Component(s) Affected

  • Other: .github/workflows/gpu-h100-{conformance,training,inference}-test.yaml

Implementation Notes

Canonical step order (consistent across all 3 workflows):

# Step Purpose
1 Snapshot and validate GPU Capture cluster state, verify GPU detection
2 Check expected resources exist Fast inventory pre-check (~10s), fail early if bundle is incomplete
3 Install Karpenter + KWOK Setup for cluster-autoscaling check; also provides ~9 min settle time for monitoring stack
4 Prepare + Install + Run chainsaw health checks Deployment health/readiness assertions
5 Validate CNCF AI Conformance Behavioral conformance checks (needs bootstrapped metrics pipeline and free GPU for dra-support)
6 (Inference only) Deploy + Validate Dynamo inference Intent-specific smoke test (consumes GPU via DRA ResourceClaim)
7 Collect and upload validation artifacts Package validation-result.yaml + conformance-evidence/

Inference ordering constraint: Conformance validation must run before Dynamo deployment because the dra-support check allocates a GPU via DRA ResourceClaim, and the Dynamo vLLM worker also consumes a GPU claim. On H100 x1 (single GPU), running Dynamo first would cause dra-support to fail with "cannot allocate all claims." This was verified when the initial version of this PR caused a dra-support failure on inference.

Step renames:

  • Collect AI conformance evidenceCheck expected resources exist — it checks resource existence, not conformance
  • Validate clusterValidate CNCF AI Conformance — clarifies this is the behavioral conformance validation
  • Upload conformance evidenceCollect and upload validation artifacts — accurate description of what's uploaded
  • Load versionsPrepare chainsaw — clarifies this is chainsaw setup plumbing
  • Install Karpenter + KWOK (setup)Install Karpenter + KWOK — removed redundant "(setup)"

Behavioral change: This PR intentionally changes the step order for conformance and inference workflows. Previously, conformance/inference ran aicr validate before chainsaw, so conformance signal was preserved even if chainsaw flaked on monitoring assertions (e.g., grafana availability). In the new order, chainsaw runs before validation in all workflows, which means a chainsaw flake now gates conformance validation. This tradeoff is intentional — it provides a consistent, logical ordering where health checks precede behavioral tests, and gives the monitoring stack more settle time before conformance checks that depend on it.

Testing

YAML-only workflow changes. Validated by:

  • yamllint on all 3 workflow files
  • Manual review of step dependencies and DRA resource constraints
  • Inference dra-support regression caught and fixed by moving conformance before Dynamo

Risk Assessment

  • Low — Easy to revert

Risks:

  • Chainsaw health check flakes (e.g., grafana availability on slow H100 x2 runners) will now block conformance validation in workflows where they previously didn't. Mitigation: the additional ~9 min from Karpenter install before chainsaw provides more settle time.

Rollout notes: N/A — CI workflow changes only.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 12, 2026 04:03
@yuanchen8911 yuanchen8911 force-pushed the refactor/gpu-workflow-step-ordering branch from e8ccd16 to 2d0062c Compare April 12, 2026 04:11
@github-actions github-actions bot added size/M and removed size/L labels Apr 12, 2026
@yuanchen8911 yuanchen8911 force-pushed the refactor/gpu-workflow-step-ordering branch 3 times, most recently from 4ded55d to dc303a8 Compare April 12, 2026 04:38
@yuanchen8911 yuanchen8911 force-pushed the refactor/gpu-workflow-step-ordering branch from dc303a8 to 91f88c9 Compare April 12, 2026 05:03
@yuanchen8911 yuanchen8911 requested review from mchmarny and xdu31 April 12, 2026 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant