[DYNAMO] smoke runner follow-up from tested branch#2445
Draft
AmeenP wants to merge 2 commits intofeat/dynamo-deployment-examplefrom
Draft
[DYNAMO] smoke runner follow-up from tested branch#2445AmeenP wants to merge 2 commits intofeat/dynamo-deployment-examplefrom
AmeenP wants to merge 2 commits intofeat/dynamo-deployment-examplefrom
Conversation
0b4323d to
6e78613
Compare
06b03e3 to
8f1aafe
Compare
Combines the existing run_dynamo.sh (GPU 0) and run_smoke_test.sh (GPU 1) into a single one-shot launcher. Recovered from the bis/dynamo-integration branch as a convenience wrapper for the local smoke flow documented in current-plan.md §8.1. Signed-off-by: Biswa Panda <biswa.panda@gmail.com>
This machine has 1× NVIDIA RTX PRO 6000 Blackwell (96 GB), not 2 GPUs. Both Dynamo inference and the prime-rl trainer must share GPU 0. Changes: * run_full_smoke.sh: trainer CUDA_VISIBLE_DEVICES 1 -> 0; add nvidia-smi and Dynamo /health preflight checks; tighten with set -euo pipefail. * run_dynamo.sh: pass --gpu-memory-utilization 0.45 to the vLLM worker by default so the trainer has ~50 GB to load FSDP-sharded weights + optimizer state. Override with GPU_MEM_UTIL env var if needed. Signed-off-by: Biswa Panda <biswa.panda@gmail.com>
6e78613 to
500a5b1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports the missing smoke-test runner fixes from Biswa's tested
biswapanda/prime-rl@bis/prime-rl-mergedbranch on top of #2394 (feat/dynamo-deployment-example).tools/dynamo/run_full_smoke.shfor the orchestrator + trainer smoke flowrun_dynamo.shvLLM worker memory viaGPU_MEM_UTILdefaulting to0.45Context
This keeps the tested local Dynamo smoke fixes reviewable separately from the original deployment example PR. The branch has been rebased onto latest
prime-rl/mainvia its parent #2394; the admin-stub formatting fix now lives in #2394 itself.Validation
uvx ruff==0.13.0 check tools/dynamo/admin_stub.pyuvx ruff==0.13.0 format --check tools/dynamo/admin_stub.pypython -m py_compile tools/dynamo/admin_stub.pybash -n tools/dynamo/run_dynamo.sh tools/dynamo/run_smoke_test.sh tools/dynamo/run_full_smoke.shFull pytest is left to Linux CI for this stack because the local checkout is macOS while the lockfile targets Linux environments.