|
3 | 3 |
|
4 | 4 | # Standalone model-streamer CI lane. |
5 | 5 | # |
6 | | -# Status — STAGE 1 (S3-upload only): the Docker daemon is not yet |
7 | | -# provisioned on the GPU runner (prod-modelexpress-tester-amd-gpu-v1), so |
8 | | -# `docker build` / `docker run` would fail. While ops works on enabling |
9 | | -# DinD / socket-mount + GPU passthrough, this workflow runs only the parts |
10 | | -# that don't need Docker: |
| 6 | +# Status — STAGE 2 (full end-to-end): Docker is now provisioned on the GPU |
| 7 | +# runner (DinD) and NGC_API_KEY is available, so we can build the worker |
| 8 | +# image locally on the runner and exercise the full streamer-load + inference |
| 9 | +# flow without round-tripping through NGC. End-to-end this workflow does: |
11 | 10 | # 1. Checkout |
12 | | -# 2. Install boto3 + huggingface-cli |
13 | | -# 3. Download safetensors from HuggingFace (if not already cached in S3) |
14 | | -# and upload them to s3://${MX_CI_S3_BUCKET}/models/${MX_CI_MODEL}/ |
| 11 | +# 2. Stage safetensors in S3 (idempotent — skipped if already cached) |
| 12 | +# 3. Build the vLLM worker image locally on the runner (no push) |
| 13 | +# 4. Run a single vLLM container with --load-format mx and |
| 14 | +# RUNAI_STREAMER_CONCURRENCY set; weights stream from S3 via IRSA |
| 15 | +# 5. Wait for "Model streamer weight loading complete" in the container |
| 16 | +# logs |
| 17 | +# 6. Wait for /health on the OpenAI server |
| 18 | +# 7. Send a /v1/completions request; assert non-empty completion text |
| 19 | +# 8. Always-runs cleanup: stop+rm the container, remove the local image |
| 20 | +# (cancel-in-progress concurrency means GHA could SIGTERM us mid-run; |
| 21 | +# cleanup steps are marked `if: always()` so the runner's Docker daemon |
| 22 | +# doesn't accumulate stale containers/images across runs) |
15 | 23 | # |
16 | | -# What this validates today: |
17 | | -# - Runner picks up GHA jobs end-to-end |
18 | | -# - Self-hosted runner has internet egress for `pip install` + HF download |
19 | | -# - IRSA on the runner has the right S3 permissions (list / put / head) |
| 24 | +# The steps are inlined from .github/actions/run-mx-streamer-test/action.yml |
| 25 | +# so we can iterate on them in isolation. Once stable here, port any |
| 26 | +# improvements back into that composite action. |
20 | 27 | # |
21 | | -# What it does NOT validate yet (deferred to STAGE 2 once Docker is up): |
22 | | -# - vLLM image build |
23 | | -# - runai-model-streamer reading from S3 inside the container |
24 | | -# - --load-format mx + ModelExpress plugin loading |
25 | | -# - OpenAI inference endpoint comes up |
26 | | -# |
27 | | -# When ops enables Docker on the runner, re-add the steps from the full |
28 | | -# composite action at .github/actions/run-mx-streamer-test/action.yml |
29 | | -# (build, docker run, wait, verify inference, cleanup) — they're already |
30 | | -# written and tested in the main workflow's `model-streamer-vllm` job. |
| 28 | +# Local-build / no-NGC-push pattern: doing both build and run on the GPU |
| 29 | +# runner means the image lives in the runner's Docker daemon only — never |
| 30 | +# pushed to or pulled from any registry. Saves bandwidth + avoids exercising |
| 31 | +# the NGC pull path here (the main workflow's `model-streamer-vllm` job does |
| 32 | +# the registry round-trip; this one focuses on the runner-local mechanics). |
31 | 33 | # |
32 | 34 | # Required secrets: |
33 | | -# HF_TOKEN — HuggingFace token; ignored when empty (Qwen2.5-0.5B is |
34 | | -# public). Required only if the model is gated. |
| 35 | +# NGC_API_KEY — used to `docker login nvcr.io` so the build can pull the |
| 36 | +# base image `vllm/vllm-openai:v0.17.1` (public Docker Hub) |
| 37 | +# + any nvcr.io transitive layers. Reuses the same secret |
| 38 | +# the main workflow uses. |
| 39 | +# HF_TOKEN — HuggingFace token; ignored when empty (Qwen2.5-0.5B is |
| 40 | +# public). Required only if the model is gated. |
35 | 41 | # |
36 | 42 | # Required IRSA on the GPU runner: |
37 | 43 | # IAM role with list / put / head on s3://${MX_CI_S3_BUCKET}/models/. |
38 | 44 |
|
39 | 45 | name: ModelExpress Model Streamer Test |
40 | 46 |
|
41 | 47 | on: |
42 | | - # Triggered indirectly via copy-pr-bot. PR-driven flow: |
43 | | - # 1. Open a PR (the PR itself does NOT fire this workflow — `pull_request` |
44 | | - # events are forbidden on Velonix self-hosted runners because PR-head |
45 | | - # code is untrusted). |
46 | | - # 2. Either: |
47 | | - # a. All commits on the PR are GPG-signed → copy-pr-bot trusts the |
48 | | - # author and auto-creates `pull-request/<N>` with the PR content. |
49 | | - # b. Commits are unsigned or come from an external contributor → |
50 | | - # a maintainer comments `/ok to test <commit_sha>` on the PR. |
51 | | - # 3. The bot pushes the PR content to `pull-request/<N>` in this repo. |
52 | | - # That push (from the trusted bot identity, into an internal branch) |
53 | | - # fires this workflow. |
54 | | - # See the dynamo reference: ai-dynamo/dynamo/.github/workflows/pr.yaml. |
55 | 48 | push: |
56 | 49 | branches: |
57 | 50 | - "pull-request/[0-9]+" |
|
67 | 60 | MX_CI_MODEL: Qwen/Qwen2.5-0.5B |
68 | 61 | MX_CI_S3_BUCKET: ai-dynamo-modelexpress-ci |
69 | 62 | MX_CI_S3_REGION: us-east-1 |
| 63 | + # Docker / runtime knobs (previously composite-action inputs). Tweak here |
| 64 | + # without touching the composite action. |
| 65 | + MX_CI_VLLM_PORT: "18888" |
| 66 | + MX_CI_STREAMER_CONCURRENCY: "16" |
| 67 | + MX_CI_LOAD_TIMEOUT_SECONDS: "300" |
70 | 68 |
|
71 | 69 | jobs: |
72 | | - s3-upload: |
73 | | - name: S3 upload (Stage 1, no Docker) |
| 70 | + model-streamer: |
| 71 | + name: Model Streamer test (vLLM, S3) |
74 | 72 | runs-on: prod-modelexpress-tester-amd-gpu-v1 |
75 | 73 | permissions: |
76 | 74 | contents: read |
77 | 75 |
|
| 76 | + env: |
| 77 | + # Local image tag, lives only in the GPU runner's Docker daemon — |
| 78 | + # never pushed to or pulled from any registry. |
| 79 | + WORKER_IMAGE: mx-worker-vllm:local |
| 80 | + |
78 | 81 | steps: |
79 | 82 | - name: Checkout |
80 | 83 | uses: actions/checkout@v4 |
@@ -162,3 +165,127 @@ jobs: |
162 | 165 | ) |
163 | 166 | print(f"Upload verified: {len(uploaded)} file(s).") |
164 | 167 | EOF |
| 168 | +
|
| 169 | + - name: Log in to NGC for base image pulls |
| 170 | + env: |
| 171 | + NGC_API_KEY: ${{ secrets.NGC_API_KEY }} |
| 172 | + run: | |
| 173 | + echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin |
| 174 | +
|
| 175 | + - name: Build vLLM worker image (local, no registry) |
| 176 | + # Build directly on the GPU runner — image lives in the local Docker |
| 177 | + # daemon and is referenced by `docker run` below via the WORKER_IMAGE |
| 178 | + # env. Build context is repo root (Dockerfile COPYs from |
| 179 | + # modelexpress_client/python/). |
| 180 | + run: | |
| 181 | + docker build \ |
| 182 | + -f ci/k8s/client/vllm/Dockerfile \ |
| 183 | + -t "${WORKER_IMAGE}" \ |
| 184 | + . |
| 185 | +
|
| 186 | + - name: Start vLLM container with model streamer |
| 187 | + env: |
| 188 | + # WORKER_IMAGE from job-level env (local tag built above). |
| 189 | + MODEL: ${{ env.MX_CI_MODEL }} |
| 190 | + S3_BUCKET: ${{ env.MX_CI_S3_BUCKET }} |
| 191 | + S3_REGION: ${{ env.MX_CI_S3_REGION }} |
| 192 | + PORT: ${{ env.MX_CI_VLLM_PORT }} |
| 193 | + STREAMER_CONCURRENCY: ${{ env.MX_CI_STREAMER_CONCURRENCY }} |
| 194 | + run: | |
| 195 | + set -euo pipefail |
| 196 | + MODEL_S3_URI="s3://${S3_BUCKET}/models/${MODEL}" |
| 197 | +
|
| 198 | + # IRSA credentials don't auto-propagate into child containers. The EKS |
| 199 | + # runner pod gets AWS_ROLE_ARN + AWS_WEB_IDENTITY_TOKEN_FILE wired in, |
| 200 | + # but a `docker run` child sees neither unless we forward them: pass |
| 201 | + # the role ARN as -e and mount the OIDC token file in as a volume. |
| 202 | + # boto3 inside the container then does its own sts:AssumeRoleWithWebIdentity. |
| 203 | + IRSA_TOKEN="${AWS_WEB_IDENTITY_TOKEN_FILE:-/var/run/secrets/eks.amazonaws.com/serviceaccount/token}" |
| 204 | +
|
| 205 | + CONTAINER_ID=$(docker run -d --gpus all --ipc=host \ |
| 206 | + -e AWS_ROLE_ARN="${AWS_ROLE_ARN}" \ |
| 207 | + -e AWS_WEB_IDENTITY_TOKEN_FILE="/var/run/secrets/eks-token" \ |
| 208 | + -v "${IRSA_TOKEN}:/var/run/secrets/eks-token:ro" \ |
| 209 | + -e AWS_DEFAULT_REGION="${S3_REGION}" \ |
| 210 | + -e MX_MODEL_URI="${MODEL_S3_URI}" \ |
| 211 | + -e RUNAI_STREAMER_CONCURRENCY="${STREAMER_CONCURRENCY}" \ |
| 212 | + -e VLLM_PLUGINS=modelexpress \ |
| 213 | + -p "${PORT}:${PORT}" \ |
| 214 | + "${WORKER_IMAGE}" \ |
| 215 | + python3 -m vllm.entrypoints.openai.api_server \ |
| 216 | + --model "${MODEL}" \ |
| 217 | + --load-format mx \ |
| 218 | + --port "${PORT}") |
| 219 | + echo "Container: ${CONTAINER_ID}" |
| 220 | + echo "CONTAINER_ID=${CONTAINER_ID}" >> "$GITHUB_ENV" |
| 221 | +
|
| 222 | + - name: Wait for model streamer to complete |
| 223 | + env: |
| 224 | + LOAD_TIMEOUT: ${{ env.MX_CI_LOAD_TIMEOUT_SECONDS }} |
| 225 | + run: | |
| 226 | + set -euo pipefail |
| 227 | + deadline=$((SECONDS + LOAD_TIMEOUT)) |
| 228 | + while [ $SECONDS -lt $deadline ]; do |
| 229 | + if docker logs "${CONTAINER_ID}" 2>&1 | grep -q "Model streamer weight loading complete"; then |
| 230 | + echo "Model streamer loading confirmed." |
| 231 | + exit 0 |
| 232 | + fi |
| 233 | + if [ "$(docker inspect "${CONTAINER_ID}" --format '{{.State.Running}}')" != "true" ]; then |
| 234 | + echo "ERROR: container exited before model streamer completed." |
| 235 | + docker logs "${CONTAINER_ID}" 2>&1 | tail -80 |
| 236 | + exit 1 |
| 237 | + fi |
| 238 | + echo "Still loading... (${SECONDS}s elapsed)" |
| 239 | + sleep 10 |
| 240 | + done |
| 241 | +
|
| 242 | + echo "ERROR: model streamer did not complete within ${LOAD_TIMEOUT}s." |
| 243 | + docker logs "${CONTAINER_ID}" 2>&1 | tail -80 |
| 244 | + exit 1 |
| 245 | +
|
| 246 | + - name: Wait for OpenAI server to be ready |
| 247 | + env: |
| 248 | + PORT: ${{ env.MX_CI_VLLM_PORT }} |
| 249 | + run: | |
| 250 | + set -euo pipefail |
| 251 | + timeout 60 bash -c \ |
| 252 | + "until curl -sf http://localhost:${PORT}/health > /dev/null; do sleep 2; done" |
| 253 | + echo "Server ready on port ${PORT}." |
| 254 | +
|
| 255 | + - name: Verify inference |
| 256 | + env: |
| 257 | + MODEL: ${{ env.MX_CI_MODEL }} |
| 258 | + PORT: ${{ env.MX_CI_VLLM_PORT }} |
| 259 | + run: | |
| 260 | + set -euo pipefail |
| 261 | + RESPONSE=$(curl -sS --max-time 60 "http://localhost:${PORT}/v1/completions" \ |
| 262 | + -H "Content-Type: application/json" \ |
| 263 | + -d "{\"model\": \"${MODEL}\", \"prompt\": \"The capital of France is\", \"max_tokens\": 8}") |
| 264 | + echo "Response: ${RESPONSE}" |
| 265 | + echo "${RESPONSE}" | python3 -c " |
| 266 | + import json, sys |
| 267 | + body = json.load(sys.stdin) |
| 268 | + choices = body.get('choices', []) |
| 269 | + assert choices and choices[0].get('text'), f'No completion text in response: {body}' |
| 270 | + print('Inference OK:', repr(choices[0]['text'][:60])) |
| 271 | + " |
| 272 | +
|
| 273 | + - name: Cleanup container |
| 274 | + if: always() |
| 275 | + run: | |
| 276 | + if [ -n "${CONTAINER_ID:-}" ]; then |
| 277 | + echo "::group::Container logs (tail 200)" |
| 278 | + docker logs "${CONTAINER_ID}" 2>&1 | tail -200 || true |
| 279 | + echo "::endgroup::" |
| 280 | + docker stop "${CONTAINER_ID}" 2>/dev/null || true |
| 281 | + docker rm -f "${CONTAINER_ID}" 2>/dev/null || true |
| 282 | + fi |
| 283 | +
|
| 284 | + - name: Cleanup local image |
| 285 | + # `if: always()` so this also runs when the job is cancelled by |
| 286 | + # cancel-in-progress concurrency. Without it, the locally built |
| 287 | + # vLLM worker image (~15GB) accumulates in the GPU runner's Docker |
| 288 | + # daemon across runs and eventually exhausts disk. |
| 289 | + if: always() |
| 290 | + run: | |
| 291 | + docker image rm -f "${WORKER_IMAGE}" 2>/dev/null || true |
0 commit comments