Skip to content

Commit b10ad3c

Browse files
authored
Merge branch 'main' into h200-aks
Signed-off-by: Mark Chmarny <mchmarny@users.noreply.github.com>
2 parents 78be37f + 31cc57e commit b10ad3c

138 files changed

Lines changed: 5538 additions & 776 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/CLAUDE.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -247,6 +247,27 @@ slog.Error("operation failed", "error", err, "component", "gpu-collector")
247247
248248
**Note:** A component must have either `helm` OR `kustomize` configuration, not both.
249249

250+
**Using mixins for shared OS/platform content:**
251+
```yaml
252+
# Leaf overlay referencing mixins instead of duplicating content
253+
spec:
254+
base: h100-eks-ubuntu-training
255+
mixins:
256+
- os-ubuntu # Ubuntu constraints (defined once in recipes/mixins/)
257+
- platform-kubeflow # kubeflow-trainer component (defined once in recipes/mixins/)
258+
criteria:
259+
service: eks
260+
accelerator: h100
261+
os: ubuntu
262+
intent: training
263+
platform: kubeflow
264+
constraints:
265+
- name: K8s.server.version
266+
value: ">= 1.32.4"
267+
```
268+
269+
Mixins carry only `constraints` and `componentRefs` — no `criteria`, `base`, `mixins`, or `validation`. They live in `recipes/mixins/` with `kind: RecipeMixin`.
270+
250271
## Error Wrapping Rules
251272

252273
**Never return bare errors.** Every `return err` must wrap with context:
@@ -457,6 +478,35 @@ ${AICR_BIN} validate -r recipe.yaml -s snapshot.yaml --no-cluster
457478
| Ignore `Close()` error on writable file handles | Capture and check `closeErr := f.Close()` |
458479
| Hardcode resource names from templates | Extract to named constants to keep code and templates in sync |
459480

481+
## Pull Request Requirements
482+
483+
**Pre-push checklist:** Always run `make qualify` before pushing. This is the CI-equivalent gate that covers tests, linting (golangci-lint + yamllint), e2e, vulnerability scan, and repo-specific checks (docs sidebar, agents sync). Do not substitute a subset of commands — if `make qualify` passes locally, CI will pass.
484+
485+
**Branch hygiene:**
486+
- Always rebase onto the target branch before pushing: `git fetch origin main && git rebase origin/main`
487+
- Squash commits into a single commit before push
488+
- Cryptographically sign commits (`git commit -S`)
489+
490+
**PR description:** Use the template from `.github/PULL_REQUEST_TEMPLATE.md` exactly as defined there. Do not inline a modified copy — read and fill in the canonical template. The template covers: Summary, Motivation/Context (with Fixes/Related), Type of Change, Components Affected, Implementation Notes, Testing, Risk Assessment, and Checklist.
491+
492+
**Test coverage gate (Go packages only):**
493+
Before pushing a PR that changes Go source files, check test coverage on affected packages. Set `pkg` to the narrowest directory root you want to measure — `$pkg/...` intentionally includes descendant packages. Prefer the narrowest changed root (e.g., if only `pkg/collector/topology` changed, use `pkg=pkg/collector/topology`, not `pkg=pkg/collector`). Use a broader root only when you intentionally want one combined delta across related subpackages.
494+
1. Run `GOFLAGS="-mod=vendor" go test -coverprofile=cover.out ./$pkg/...` on each changed package
495+
2. Get the baseline using a clean worktree (changes must be committed first): `(git worktree add $TMPDIR/baseline origin/main && (cd $TMPDIR/baseline && GOFLAGS="-mod=vendor" go test -coverprofile=$TMPDIR/base.out ./$pkg/...); rc=$?; git worktree remove --force $TMPDIR/baseline; return $rc 2>/dev/null || (exit $rc))`. This preserves the test exit status through cleanup. Write the profile to `$TMPDIR/base.out` (outside the worktree) so it survives cleanup. Compare with `go tool cover -func` on both profiles. Skip this step for entirely new packages.
496+
3. **Block** if `make test-coverage` fails — this enforces the project-wide 70% floor (from `.settings.yaml`). Do not use per-package profiles for this check.
497+
4. **Flag** any package with per-package coverage decrease > 0.5% (comparing step 1 vs step 2)
498+
5. **Block** if any new exported function or method (identified via `git diff origin/main -- $pkg/` — look for added `func` lines with uppercase names) has 0% coverage — add tests before pushing
499+
6. Report the delta in the PR description's Testing section (e.g., `pkg/recipe: 90.4% → 90.3% (-0.1%)`)
500+
This rule does not apply to non-Go changes (YAML, docs, CI workflows). Note: CI also posts per-package coverage deltas post-push via `go-coverage-report` in `on-push-comment.yaml`; this gate catches regressions before push.
501+
502+
**PR policy:**
503+
- Do NOT add `Co-Authored-By` lines (organization policy)
504+
- Do NOT add "Generated with Claude Code", "Created by Codex", or similar attribution
505+
- Add appropriate type labels: `enhancement`, `bug`, `documentation`
506+
- Area labels are auto-assigned by `.github/labeler.yml` based on changed file paths (e.g., `area/recipes`, `area/ci`, `area/api`, `area/cli`, `area/bundler`, `area/collector`, `area/validator`, `area/docs`, `area/infra`, `area/tests`). You may also add them manually when the auto-labeler wouldn't match (e.g., issue-only PRs or cross-cutting changes).
507+
- Do NOT add `size/*` labels (auto-assigned by bot)
508+
- Keep the PR title under 70 characters; use the description for details
509+
460510
## Key Files
461511

462512
| File | Purpose |
@@ -467,6 +517,7 @@ ${AICR_BIN} validate -r recipe.yaml -s snapshot.yaml --no-cluster
467517
| `.settings.yaml` | Project settings: tool versions, quality thresholds, build/test config (single source of truth) |
468518
| `recipes/registry.yaml` | Declarative component configuration |
469519
| `recipes/overlays/*.yaml` | Recipe overlay definitions |
520+
| `recipes/mixins/*.yaml` | Composable mixin fragments (OS constraints, platform components) |
470521
| `recipes/components/*/values.yaml` | Component Helm values |
471522
| `api/aicr/v1/server.yaml` | OpenAPI spec |
472523
| `.goreleaser.yaml` | Release configuration |

.github/actions/aicr-build/action.yml

Lines changed: 48 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,16 @@
1515
name: 'AICR Build'
1616
description: 'Builds the aicr validator image (via Dockerfile) and CLI binary, and loads the image into kind.'
1717

18+
inputs:
19+
build_validators:
20+
description: 'Deprecated: use validator_phases instead. Ignored when validator_phases is set.'
21+
required: false
22+
default: 'true'
23+
validator_phases:
24+
description: 'Comma-separated validator phases to build (e.g., "conformance,deployment"), or "none" to skip all. Takes precedence over build_validators.'
25+
required: false
26+
default: ''
27+
1828
runs:
1929
using: 'composite'
2030
steps:
@@ -27,28 +37,54 @@ runs:
2737
2838
- name: Build snapshot agent image and load into kind
2939
shell: bash
40+
env:
41+
GOFLAGS: -mod=vendor
3042
run: |
31-
# Build snapshot agent image with CUDA runtime (provides nvidia-smi for GPU detection).
43+
# Build snapshot agent image with CUDA base (provides nvidia-smi for GPU detection).
44+
# Uses cuda:base (~250MB) instead of cuda:runtime (~1.8GB) — only nvidia-smi is needed.
3245
# GPU test workflows use --image=ko.local:smoke-test for aicr snapshot.
3346
CGO_ENABLED=0 go build -trimpath -o dist/aicr ./cmd/aicr
3447
docker build -t ko.local:smoke-test -f - . <<'DOCKERFILE'
35-
FROM nvcr.io/nvidia/cuda:13.1.0-runtime-ubuntu24.04
48+
FROM nvcr.io/nvidia/cuda:13.1.0-base-ubuntu24.04
3649
COPY dist/aicr /usr/local/bin/aicr
3750
ENTRYPOINT ["/usr/local/bin/aicr"]
3851
DOCKERFILE
39-
kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}"
52+
53+
# Load onto all nodes. The snapshot agent requests nvidia.com/gpu but
54+
# does not set a node selector, so it can land on any GPU-capable node
55+
# including the control-plane (e.g., T4 smoke test).
56+
timeout 600 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}" || {
57+
echo "::warning::kind load attempt 1 failed for ko.local:smoke-test, retrying..."
58+
timeout 600 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}"
59+
}
4060
4161
- name: Build validator images and load into kind
62+
if: "!(inputs.validator_phases == 'none' || (inputs.validator_phases == '' && inputs.build_validators == 'false'))"
4263
shell: bash
4364
env:
4465
GOFLAGS: -mod=vendor
4566
run: |
46-
# Compile validator binaries on host (Go build cache) then COPY-only images.
67+
# Determine which validator phases to build.
68+
# validator_phases takes precedence; build_validators is a deprecated fallback.
69+
if [[ -n "${{ inputs.validator_phases }}" ]]; then
70+
if [[ "${{ inputs.validator_phases }}" == "none" ]]; then
71+
echo "Skipping validator builds (validator_phases=none)"
72+
exit 0
73+
fi
74+
PHASES="${{ inputs.validator_phases }}"
75+
else
76+
# Default: build all phases (backwards compatible)
77+
PHASES="deployment,performance,conformance"
78+
fi
79+
80+
# Compile only the requested validator binaries.
4781
mkdir -p dist/validator
48-
CGO_ENABLED=0 go build -trimpath -o dist/validator/deployment ./validators/deployment
49-
CGO_ENABLED=0 go build -trimpath -o dist/validator/performance ./validators/performance
50-
CGO_ENABLED=0 go build -trimpath -o dist/validator/conformance ./validators/conformance
51-
for phase in deployment performance conformance; do
82+
for phase in ${PHASES//,/ }; do
83+
echo "Building validator binary: ${phase}"
84+
CGO_ENABLED=0 go build -trimpath -o "dist/validator/${phase}" "./validators/${phase}"
85+
done
86+
87+
for phase in ${PHASES//,/ }; do
5288
mkdir -p "validators/${phase}/testdata"
5389
docker build -t "ko.local/aicr-validators/${phase}:latest" -f - . <<DOCKERFILE
5490
FROM gcr.io/distroless/static-debian12:nonroot
@@ -58,7 +94,10 @@ runs:
5894
USER nonroot
5995
ENTRYPOINT ["/${phase}"]
6096
DOCKERFILE
61-
kind load docker-image "ko.local/aicr-validators/${phase}:latest" --name "${KIND_CLUSTER_NAME}"
97+
timeout 300 kind load docker-image "ko.local/aicr-validators/${phase}:latest" --name "${KIND_CLUSTER_NAME}" || {
98+
echo "::warning::kind load attempt 1 failed for ko.local/aicr-validators/${phase}:latest, retrying..."
99+
timeout 300 kind load docker-image "ko.local/aicr-validators/${phase}:latest" --name "${KIND_CLUSTER_NAME}"
100+
}
62101
done
63102
64103
- name: Build aicr binary

.github/actions/gpu-snapshot-validate/action.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,9 @@ runs:
4545
- name: Validate snapshot detected GPU
4646
shell: bash
4747
run: |
48-
GPU_MODEL=$(yq eval '.measurements[] | select(.type == "GPU") | .subtypes[0].data["gpu.model"]' snapshot.yaml)
49-
GPU_COUNT=$(yq eval '.measurements[] | select(.type == "GPU") | .subtypes[0].data["gpu-count"]' snapshot.yaml)
48+
# Query by subtype field (not index) — #502 added a "hardware" subtype before "smi".
49+
GPU_MODEL=$(yq eval '.measurements[] | select(.type == "GPU") | .subtypes[] | select(.subtype == "smi") | .data["gpu.model"]' snapshot.yaml)
50+
GPU_COUNT=$(yq eval '.measurements[] | select(.type == "GPU") | .subtypes[] | select(.subtype == "smi") | .data["gpu-count"]' snapshot.yaml)
5051
echo "GPU model: ${GPU_MODEL}"
5152
echo "GPU count: ${GPU_COUNT}"
5253
if [[ "${GPU_MODEL}" != *"${{ inputs.gpu_model }}"* ]]; then

.github/workflows/build-attested.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ jobs:
9292
done
9393
9494
- name: Upload archives
95-
uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
95+
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
9696
with:
9797
name: aicr-attested-binaries
9898
path: dist/*.tar.gz

.github/workflows/conflict-check.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ jobs:
3737
timeout-minutes: 10
3838
steps:
3939
- name: Check Mergeable State
40-
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
40+
uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0
4141
with:
4242
script: |
4343
const label = 'needs-rebase';
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# Validates Fern docs configuration on pull requests that touch docs or fern/.
16+
17+
name: Fern Docs CI
18+
19+
on:
20+
pull_request:
21+
paths:
22+
- 'docs/**'
23+
- 'fern/**'
24+
- '.github/workflows/fern-docs-ci.yaml'
25+
workflow_dispatch: {}
26+
27+
permissions:
28+
contents: read
29+
30+
jobs:
31+
fern-check:
32+
name: Fern Check
33+
runs-on: ubuntu-latest
34+
timeout-minutes: 10
35+
steps:
36+
- name: Checkout
37+
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
38+
39+
- name: Setup Node.js
40+
uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0
41+
with:
42+
node-version: '20'
43+
44+
- name: Install Fern CLI
45+
run: npm install -g fern-api@$(jq -r .version fern/fern.config.json)
46+
47+
- name: Fern check
48+
run: fern check
49+
50+
- name: Check links
51+
uses: lycheeverse/lychee-action@8646ba30535128ac92d33dfc9133794bfdd9b411 # v2.8.0
52+
with:
53+
args: --offline --no-progress 'docs/**/*.md'
54+
fail: true
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# Workflow 1 of 2 for Fern doc previews.
16+
#
17+
# Collects the fern/ sources and PR metadata from the (possibly untrusted) PR
18+
# branch and uploads them as an artifact. No secrets are used here, so this is
19+
# safe to run on fork PRs via the regular pull_request trigger.
20+
#
21+
# The companion workflow (fern-docs-preview-comment.yml) picks up the artifact,
22+
# builds the preview with DOCS_FERN_TOKEN, and posts the PR comment.
23+
24+
name: "Preview Fern Docs: Build"
25+
26+
on:
27+
pull_request:
28+
paths:
29+
- 'docs/**'
30+
- 'fern/**'
31+
- '.github/workflows/fern-docs-preview-build.yml'
32+
33+
permissions:
34+
contents: read
35+
36+
jobs:
37+
collect:
38+
runs-on: ubuntu-latest
39+
steps:
40+
- name: Checkout PR
41+
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
42+
with:
43+
fetch-depth: 0
44+
45+
- name: Save PR metadata
46+
env:
47+
PR_NUMBER: ${{ github.event.pull_request.number }}
48+
HEAD_REF: ${{ github.head_ref }}
49+
BASE_REF: ${{ github.base_ref }}
50+
run: |
51+
mkdir -p preview-metadata
52+
echo "$PR_NUMBER" > preview-metadata/pr_number
53+
echo "$HEAD_REF" > preview-metadata/head_ref
54+
git diff --name-only "origin/${BASE_REF}...HEAD" -- '*.md' > preview-metadata/changed_md_files 2>/dev/null || true
55+
56+
- name: Upload fern sources and metadata
57+
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
58+
with:
59+
name: fern-preview
60+
path: |
61+
fern/
62+
docs/
63+
preview-metadata/
64+
retention-days: 1

0 commit comments

Comments
 (0)