fix(e2e): harden Brev E2E infrastructure for CPU-only instances#1266
fix(e2e): harden Brev E2E infrastructure for CPU-only instances#1266jyaunches wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Shared E2E infrastructure improvements that make Brev-based CI reliable on CPU-only instances: brev-setup.sh: - Extract HAS_GPU flag for consistent GPU detection — nvidia-smi must run successfully, not just exist on PATH (Brev GPU images ship the binary even on CPU instances) - All GPU gates (container toolkit, Docker runtime reset, vLLM) use the single HAS_GPU flag - Replace cloud-init || true with proper error check + warning - Add timeout (300s) and quiet window (5s) to apt wait loop to prevent indefinite hangs and races - Add retry loops (5 attempts, 30s backoff) for Node.js and Docker apt-get installs - Unsuppress NodeSource installer output for debugging - Reset Docker default runtime to runc on CPU-only instances where Brev pre-configures nvidia as default setup.sh: - Check nvidia-smi exit code instead of just PATH presence for GPU gateway flag brev-e2e.test.js: - Use known GCP instance type (n2-standard-4) instead of cpu search - Add shellQuote() for safe secret interpolation in SSH commands - Delete leftover instances before creating new ones - Wait for cloud-init before running bootstrap
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe changes centralize GPU detection via a runtime probe, add an exclusive apt-access sequence to avoid package manager conflicts, add retry/backoff and logging for Node.js and Docker installs, adjust Docker daemon for CPU-only systems, and change E2E test instance provisioning to use a direct instance type with improved env quoting and cloud-init waits. Changes
Sequence DiagramsequenceDiagram
participant Test as E2E Test
participant Script as brev-setup.sh
participant CloudInit as cloud-init
participant Apt as apt/dpkg System
participant Docker as Docker Daemon
participant NVIDIA as nvidia-smi
Test->>Script: trigger setup
Script->>NVIDIA: run nvidia-smi (runtime probe)
NVIDIA-->>Script: success / failure
alt GPU available
Script->>Script: set HAS_GPU=true
Script->>Docker: ensure nvidia default-runtime present / install NVIDIA Toolkit
Script->>Script: install/start vLLM (GPU path)
else CPU-only
Script->>Docker: check /etc/docker/daemon.json
Docker-->>Script: returns default-runtime
Script->>Docker: remove nvidia default-runtime if present
Script->>Docker: restart
end
Script->>CloudInit: wait (if cloud-init present)
CloudInit-->>Script: ready
Script->>Apt: stop apt timers & unattended-upgrades
Script->>Apt: kill lingering apt/dpkg processes
Script->>Apt: clear lock files & run dpkg --configure -a
loop Poll for idle window
Script->>Apt: lsof checks
Apt-->>Script: active / idle
end
Script->>Script: perform installs (Node/Docker) with retries
Script-->>Test: setup complete
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/brev-e2e.test.js`:
- Around line 137-143: The execSync invocation interpolates BREv_INSTANCE_TYPE
and INSTANCE_NAME into a shell string which risks shell injection or breaking on
spaces; replace the execSync call in the test (the line using execSync in
test/e2e/brev-e2e.test.js) with a safer call using child_process.execFileSync
(or spawnSync) passing the command ("brev") and args as an array [ "create",
"--type", BREV_INSTANCE_TYPE, INSTANCE_NAME, "--detached" ] and keep the same
options object (encoding, timeout, stdio), or alternatively ensure the
interpolated variables are properly quoted/escaped before passing to the shell;
target the execSync invocation that logs with elapsed() and change it
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e23869cf-0bc6-41a7-86b4-3f2f112a1dc1
📒 Files selected for processing (3)
scripts/brev-setup.shscripts/setup.shtest/e2e/brev-e2e.test.js
Replaces execSync with execFileSync for the brev create call, passing args as an array instead of interpolating into a shell string. Avoids shell injection risk if BREV_INSTANCE_TYPE or INSTANCE_NAME ever contain spaces or metacharacters. Addresses CodeRabbit review feedback on PR NVIDIA#1266.
|
✨ Thanks for submitting this pull request, which proposes a way to harden Brev E2E infrastructure for CPU-only instances. Possibly related open issues:
|
…ty (#1470) ## Summary - unify installer and onboarding host detection around shared TypeScript preflight logic - move `deploy` behavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handling - demote or remove legacy platform-specific setup paths (`setup-spark`, `brev-setup.sh`) in favor of the canonical installer + onboard flow - update docs, CLI help, and Brev E2E coverage to match the new behavior ## What Changed - added shared host assessment and remediation planning in `src/lib/preflight.ts` - wired installer and onboard flows to the same host preflight decisions - changed Podman handling from hard block to unsupported-runtime warning - migrated deploy logic into `src/lib/deploy.ts` - updated `nemoclaw deploy` to use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flow - removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh` to a deprecated compatibility wrapper - updated README/docs/help text and hardened the Brev E2E cleanup path ## Validation - `npm run build:cli` - targeted Vitest coverage for `src/lib/preflight.test.ts`, `src/lib/deploy.test.ts`, `test/install-preflight.test.js`, `test/cli.test.js`, `test/runner.test.js` - live Brev validation with `TEST_SUITE=deploy-cli` on `cpu-e2.4vcpu-16gb` - confirmed successful end-to-end remote deploy after waiting for Brev `status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY` ## Related Issues - Fixes #1377 - Addresses #1330 - Addresses #1390 - Related to #1404 ## Credit / Prior Work This branch builds on ideas and prior work from: - #1368 by @zyang-dev for simplifying Spark setup and removing the old cgroup workaround - #1395 and #1468 by @kjw3 for the thin installer/bootstrap direction and installer path reliability - #1450 by @cjagwani for switching Brev flows toward GCP for reliability - #1383 by @13ernkastel for the current Brev create flag compatibility work - #1364 by @WuKongAI-CMU for deploy sync-path fixes - #1362 and #1266 by @jyaunches for the Brev E2E/launchable infrastructure direction - issue ideas from #1377 and #1404 by @zNeill, #1330 by @Marcelo5444, and #1390 by @ericksoa <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved host diagnostics with actionable remediation guidance surfaced during installer/onboard preflight. * **Improvements** * macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now uses the standard installer + `nemoclaw onboard`. * Preflight output shows detected runtime and WSL notes; installer prints remediation actions and will skip onboarding on blocking issues. * **Deprecations** * `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap wrapper are now deprecated compatibility paths. * **Documentation** * Quickstart, troubleshooting, and command reference updated to reflect installer+onboard flow and deprecation guidance. * **Tests** * Added/updated tests covering preflight, deploy compatibility, CLI aliases, and deploy e2e scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Shared E2E infrastructure improvements that make Brev-based CI reliable on CPU-only instances. Split from #1219 to keep concerns separate.
Changes
scripts/brev-setup.sh
HAS_GPUflag for consistent GPU detection —nvidia-smimust run successfully, not just exist on PATH (Brev GPU images ship the binary even on CPU instances)HAS_GPUflagcloud-init || truewith proper error check + warning (no longer silently swallows failures)scripts/setup.sh
nvidia-smiexit code instead of just PATH presence for GPU gateway flagtest/e2e/brev-e2e.test.js
n2-standard-4) instead of cpu searchshellQuote()for safe secret interpolation in SSH commandsTesting
bash -n scripts/brev-setup.sh— syntax check passesbash -n scripts/setup.sh— syntax check passesnpm test— CLI tests passType of Change
Summary by CodeRabbit
Bug Fixes
Chores
Tests