Skip to content

Commit 284b8e5

Browse files
committed
ci(model-streamer): restore stage 2 (full end-to-end with Docker)
Signed-off-by: Tanushriya Singh <tanushriyas@nvidia.com>
1 parent 657ee5e commit 284b8e5

1 file changed

Lines changed: 166 additions & 39 deletions

File tree

.github/workflows/modelexpress-model-streamer-test.yml

Lines changed: 166 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -3,55 +3,48 @@
33

44
# Standalone model-streamer CI lane.
55
#
6-
# Status — STAGE 1 (S3-upload only): the Docker daemon is not yet
7-
# provisioned on the GPU runner (prod-modelexpress-tester-amd-gpu-v1), so
8-
# `docker build` / `docker run` would fail. While ops works on enabling
9-
# DinD / socket-mount + GPU passthrough, this workflow runs only the parts
10-
# that don't need Docker:
6+
# Status — STAGE 2 (full end-to-end): Docker is now provisioned on the GPU
7+
# runner (DinD) and NGC_API_KEY is available, so we can build the worker
8+
# image locally on the runner and exercise the full streamer-load + inference
9+
# flow without round-tripping through NGC. End-to-end this workflow does:
1110
# 1. Checkout
12-
# 2. Install boto3 + huggingface-cli
13-
# 3. Download safetensors from HuggingFace (if not already cached in S3)
14-
# and upload them to s3://${MX_CI_S3_BUCKET}/models/${MX_CI_MODEL}/
11+
# 2. Stage safetensors in S3 (idempotent — skipped if already cached)
12+
# 3. Build the vLLM worker image locally on the runner (no push)
13+
# 4. Run a single vLLM container with --load-format mx and
14+
# RUNAI_STREAMER_CONCURRENCY set; weights stream from S3 via IRSA
15+
# 5. Wait for "Model streamer weight loading complete" in the container
16+
# logs
17+
# 6. Wait for /health on the OpenAI server
18+
# 7. Send a /v1/completions request; assert non-empty completion text
19+
# 8. Always-runs cleanup: stop+rm the container, remove the local image
20+
# (cancel-in-progress concurrency means GHA could SIGTERM us mid-run;
21+
# cleanup steps are marked `if: always()` so the runner's Docker daemon
22+
# doesn't accumulate stale containers/images across runs)
1523
#
16-
# What this validates today:
17-
# - Runner picks up GHA jobs end-to-end
18-
# - Self-hosted runner has internet egress for `pip install` + HF download
19-
# - IRSA on the runner has the right S3 permissions (list / put / head)
24+
# The steps are inlined from .github/actions/run-mx-streamer-test/action.yml
25+
# so we can iterate on them in isolation. Once stable here, port any
26+
# improvements back into that composite action.
2027
#
21-
# What it does NOT validate yet (deferred to STAGE 2 once Docker is up):
22-
# - vLLM image build
23-
# - runai-model-streamer reading from S3 inside the container
24-
# - --load-format mx + ModelExpress plugin loading
25-
# - OpenAI inference endpoint comes up
26-
#
27-
# When ops enables Docker on the runner, re-add the steps from the full
28-
# composite action at .github/actions/run-mx-streamer-test/action.yml
29-
# (build, docker run, wait, verify inference, cleanup) — they're already
30-
# written and tested in the main workflow's `model-streamer-vllm` job.
28+
# Local-build / no-NGC-push pattern: doing both build and run on the GPU
29+
# runner means the image lives in the runner's Docker daemon only — never
30+
# pushed to or pulled from any registry. Saves bandwidth + avoids exercising
31+
# the NGC pull path here (the main workflow's `model-streamer-vllm` job does
32+
# the registry round-trip; this one focuses on the runner-local mechanics).
3133
#
3234
# Required secrets:
33-
# HF_TOKEN — HuggingFace token; ignored when empty (Qwen2.5-0.5B is
34-
# public). Required only if the model is gated.
35+
# NGC_API_KEY — used to `docker login nvcr.io` so the build can pull the
36+
# base image `vllm/vllm-openai:v0.17.1` (public Docker Hub)
37+
# + any nvcr.io transitive layers. Reuses the same secret
38+
# the main workflow uses.
39+
# HF_TOKEN — HuggingFace token; ignored when empty (Qwen2.5-0.5B is
40+
# public). Required only if the model is gated.
3541
#
3642
# Required IRSA on the GPU runner:
3743
# IAM role with list / put / head on s3://${MX_CI_S3_BUCKET}/models/.
3844

3945
name: ModelExpress Model Streamer Test
4046

4147
on:
42-
# Triggered indirectly via copy-pr-bot. PR-driven flow:
43-
# 1. Open a PR (the PR itself does NOT fire this workflow — `pull_request`
44-
# events are forbidden on Velonix self-hosted runners because PR-head
45-
# code is untrusted).
46-
# 2. Either:
47-
# a. All commits on the PR are GPG-signed → copy-pr-bot trusts the
48-
# author and auto-creates `pull-request/<N>` with the PR content.
49-
# b. Commits are unsigned or come from an external contributor →
50-
# a maintainer comments `/ok to test <commit_sha>` on the PR.
51-
# 3. The bot pushes the PR content to `pull-request/<N>` in this repo.
52-
# That push (from the trusted bot identity, into an internal branch)
53-
# fires this workflow.
54-
# See the dynamo reference: ai-dynamo/dynamo/.github/workflows/pr.yaml.
5548
push:
5649
branches:
5750
- "pull-request/[0-9]+"
@@ -67,14 +60,24 @@ env:
6760
MX_CI_MODEL: Qwen/Qwen2.5-0.5B
6861
MX_CI_S3_BUCKET: ai-dynamo-modelexpress-ci
6962
MX_CI_S3_REGION: us-east-1
63+
# Docker / runtime knobs (previously composite-action inputs). Tweak here
64+
# without touching the composite action.
65+
MX_CI_VLLM_PORT: "18888"
66+
MX_CI_STREAMER_CONCURRENCY: "16"
67+
MX_CI_LOAD_TIMEOUT_SECONDS: "300"
7068

7169
jobs:
72-
s3-upload:
73-
name: S3 upload (Stage 1, no Docker)
70+
model-streamer:
71+
name: Model Streamer test (vLLM, S3)
7472
runs-on: prod-modelexpress-tester-amd-gpu-v1
7573
permissions:
7674
contents: read
7775

76+
env:
77+
# Local image tag, lives only in the GPU runner's Docker daemon —
78+
# never pushed to or pulled from any registry.
79+
WORKER_IMAGE: mx-worker-vllm:local
80+
7881
steps:
7982
- name: Checkout
8083
uses: actions/checkout@v4
@@ -162,3 +165,127 @@ jobs:
162165
)
163166
print(f"Upload verified: {len(uploaded)} file(s).")
164167
EOF
168+
169+
- name: Log in to NGC for base image pulls
170+
env:
171+
NGC_API_KEY: ${{ secrets.NGC_API_KEY }}
172+
run: |
173+
echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin
174+
175+
- name: Build vLLM worker image (local, no registry)
176+
# Build directly on the GPU runner — image lives in the local Docker
177+
# daemon and is referenced by `docker run` below via the WORKER_IMAGE
178+
# env. Build context is repo root (Dockerfile COPYs from
179+
# modelexpress_client/python/).
180+
run: |
181+
docker build \
182+
-f ci/k8s/client/vllm/Dockerfile \
183+
-t "${WORKER_IMAGE}" \
184+
.
185+
186+
- name: Start vLLM container with model streamer
187+
env:
188+
# WORKER_IMAGE from job-level env (local tag built above).
189+
MODEL: ${{ env.MX_CI_MODEL }}
190+
S3_BUCKET: ${{ env.MX_CI_S3_BUCKET }}
191+
S3_REGION: ${{ env.MX_CI_S3_REGION }}
192+
PORT: ${{ env.MX_CI_VLLM_PORT }}
193+
STREAMER_CONCURRENCY: ${{ env.MX_CI_STREAMER_CONCURRENCY }}
194+
run: |
195+
set -euo pipefail
196+
MODEL_S3_URI="s3://${S3_BUCKET}/models/${MODEL}"
197+
198+
# IRSA credentials don't auto-propagate into child containers. The EKS
199+
# runner pod gets AWS_ROLE_ARN + AWS_WEB_IDENTITY_TOKEN_FILE wired in,
200+
# but a `docker run` child sees neither unless we forward them: pass
201+
# the role ARN as -e and mount the OIDC token file in as a volume.
202+
# boto3 inside the container then does its own sts:AssumeRoleWithWebIdentity.
203+
IRSA_TOKEN="${AWS_WEB_IDENTITY_TOKEN_FILE:-/var/run/secrets/eks.amazonaws.com/serviceaccount/token}"
204+
205+
CONTAINER_ID=$(docker run -d --gpus all --ipc=host \
206+
-e AWS_ROLE_ARN="${AWS_ROLE_ARN}" \
207+
-e AWS_WEB_IDENTITY_TOKEN_FILE="/var/run/secrets/eks-token" \
208+
-v "${IRSA_TOKEN}:/var/run/secrets/eks-token:ro" \
209+
-e AWS_DEFAULT_REGION="${S3_REGION}" \
210+
-e MX_MODEL_URI="${MODEL_S3_URI}" \
211+
-e RUNAI_STREAMER_CONCURRENCY="${STREAMER_CONCURRENCY}" \
212+
-e VLLM_PLUGINS=modelexpress \
213+
-p "${PORT}:${PORT}" \
214+
"${WORKER_IMAGE}" \
215+
python3 -m vllm.entrypoints.openai.api_server \
216+
--model "${MODEL}" \
217+
--load-format mx \
218+
--port "${PORT}")
219+
echo "Container: ${CONTAINER_ID}"
220+
echo "CONTAINER_ID=${CONTAINER_ID}" >> "$GITHUB_ENV"
221+
222+
- name: Wait for model streamer to complete
223+
env:
224+
LOAD_TIMEOUT: ${{ env.MX_CI_LOAD_TIMEOUT_SECONDS }}
225+
run: |
226+
set -euo pipefail
227+
deadline=$((SECONDS + LOAD_TIMEOUT))
228+
while [ $SECONDS -lt $deadline ]; do
229+
if docker logs "${CONTAINER_ID}" 2>&1 | grep -q "Model streamer weight loading complete"; then
230+
echo "Model streamer loading confirmed."
231+
exit 0
232+
fi
233+
if [ "$(docker inspect "${CONTAINER_ID}" --format '{{.State.Running}}')" != "true" ]; then
234+
echo "ERROR: container exited before model streamer completed."
235+
docker logs "${CONTAINER_ID}" 2>&1 | tail -80
236+
exit 1
237+
fi
238+
echo "Still loading... (${SECONDS}s elapsed)"
239+
sleep 10
240+
done
241+
242+
echo "ERROR: model streamer did not complete within ${LOAD_TIMEOUT}s."
243+
docker logs "${CONTAINER_ID}" 2>&1 | tail -80
244+
exit 1
245+
246+
- name: Wait for OpenAI server to be ready
247+
env:
248+
PORT: ${{ env.MX_CI_VLLM_PORT }}
249+
run: |
250+
set -euo pipefail
251+
timeout 60 bash -c \
252+
"until curl -sf http://localhost:${PORT}/health > /dev/null; do sleep 2; done"
253+
echo "Server ready on port ${PORT}."
254+
255+
- name: Verify inference
256+
env:
257+
MODEL: ${{ env.MX_CI_MODEL }}
258+
PORT: ${{ env.MX_CI_VLLM_PORT }}
259+
run: |
260+
set -euo pipefail
261+
RESPONSE=$(curl -sS --max-time 60 "http://localhost:${PORT}/v1/completions" \
262+
-H "Content-Type: application/json" \
263+
-d "{\"model\": \"${MODEL}\", \"prompt\": \"The capital of France is\", \"max_tokens\": 8}")
264+
echo "Response: ${RESPONSE}"
265+
echo "${RESPONSE}" | python3 -c "
266+
import json, sys
267+
body = json.load(sys.stdin)
268+
choices = body.get('choices', [])
269+
assert choices and choices[0].get('text'), f'No completion text in response: {body}'
270+
print('Inference OK:', repr(choices[0]['text'][:60]))
271+
"
272+
273+
- name: Cleanup container
274+
if: always()
275+
run: |
276+
if [ -n "${CONTAINER_ID:-}" ]; then
277+
echo "::group::Container logs (tail 200)"
278+
docker logs "${CONTAINER_ID}" 2>&1 | tail -200 || true
279+
echo "::endgroup::"
280+
docker stop "${CONTAINER_ID}" 2>/dev/null || true
281+
docker rm -f "${CONTAINER_ID}" 2>/dev/null || true
282+
fi
283+
284+
- name: Cleanup local image
285+
# `if: always()` so this also runs when the job is cancelled by
286+
# cancel-in-progress concurrency. Without it, the locally built
287+
# vLLM worker image (~15GB) accumulates in the GPU runner's Docker
288+
# daemon across runs and eventually exhausts disk.
289+
if: always()
290+
run: |
291+
docker image rm -f "${WORKER_IMAGE}" 2>/dev/null || true

0 commit comments

Comments
 (0)