fix(e2e): harden Brev E2E infrastructure for CPU-only instances by jyaunches · Pull Request #1266 · NVIDIA/NemoClaw

jyaunches · 2026-04-01T19:35:58Z

Summary

Shared E2E infrastructure improvements that make Brev-based CI reliable on CPU-only instances. Split from #1219 to keep concerns separate.

Changes

scripts/brev-setup.sh

Extract HAS_GPU flag for consistent GPU detection — nvidia-smi must run successfully, not just exist on PATH (Brev GPU images ship the binary even on CPU instances)
All GPU gates (container toolkit, Docker runtime reset, vLLM) use the single HAS_GPU flag
Replace cloud-init || true with proper error check + warning (no longer silently swallows failures)
Add timeout (300s) and quiet window (5s) to apt wait loop to prevent indefinite hangs and races where apt briefly pauses between operations
Add retry loops (5 attempts, 30s backoff) for Node.js and Docker apt-get installs
Unsuppress NodeSource installer output for debugging
Reset Docker default runtime to runc on CPU-only instances

scripts/setup.sh

Check nvidia-smi exit code instead of just PATH presence for GPU gateway flag

test/e2e/brev-e2e.test.js

Use known GCP instance type (n2-standard-4) instead of cpu search
Add shellQuote() for safe secret interpolation in SSH commands
Delete leftover instances before creating new ones
Wait for cloud-init before running bootstrap

Testing

bash -n scripts/brev-setup.sh — syntax check passes
bash -n scripts/setup.sh — syntax check passes
npm test — CLI tests pass
shellcheck passes (via pre-commit)

Type of Change

Code change for a new feature, bug fix, or refactor.
Code change with doc updates.
Doc only.

Summary by CodeRabbit

Bug Fixes
- More reliable GPU detection by probing runtime availability instead of binary presence
- Improved apt access handling with automatic lock clearing, background timer suppression, and wait/retry for idle state
- Automatic Docker runtime fallback to CPU when no GPU is available
Chores
- Centralized GPU flag to unify GPU-gated behavior
- Retry logic with backoff for key installers and more visible installer output
Tests
- More robust e2e provisioning and VM readiness checks, plus safer remote quoting

Shared E2E infrastructure improvements that make Brev-based CI reliable on CPU-only instances: brev-setup.sh: - Extract HAS_GPU flag for consistent GPU detection — nvidia-smi must run successfully, not just exist on PATH (Brev GPU images ship the binary even on CPU instances) - All GPU gates (container toolkit, Docker runtime reset, vLLM) use the single HAS_GPU flag - Replace cloud-init || true with proper error check + warning - Add timeout (300s) and quiet window (5s) to apt wait loop to prevent indefinite hangs and races - Add retry loops (5 attempts, 30s backoff) for Node.js and Docker apt-get installs - Unsuppress NodeSource installer output for debugging - Reset Docker default runtime to runc on CPU-only instances where Brev pre-configures nvidia as default setup.sh: - Check nvidia-smi exit code instead of just PATH presence for GPU gateway flag brev-e2e.test.js: - Use known GCP instance type (n2-standard-4) instead of cpu search - Add shellQuote() for safe secret interpolation in SSH commands - Delete leftover instances before creating new ones - Wait for cloud-init before running bootstrap

coderabbitai · 2026-04-01T19:36:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1691e813-e57c-4538-ba7a-1b034ce6672f

📥 Commits

Reviewing files that changed from the base of the PR and between d92dc54 and dafcaba.

📒 Files selected for processing (1)

test/e2e/brev-e2e.test.js

🚧 Files skipped from review as they are similar to previous changes (1)

test/e2e/brev-e2e.test.js

📝 Walkthrough

Walkthrough

The changes centralize GPU detection via a runtime probe, add an exclusive apt-access sequence to avoid package manager conflicts, add retry/backoff and logging for Node.js and Docker installs, adjust Docker daemon for CPU-only systems, and change E2E test instance provisioning to use a direct instance type with improved env quoting and cloud-init waits.

Changes

Cohort / File(s)	Summary
GPU Detection & Runtime Configuration `scripts/brev-setup.sh`, `scripts/setup.sh`	Centralized GPU detection using a runtime `nvidia-smi` probe. Gate NVIDIA Container Toolkit and vLLM install/start with `HAS_GPU`. `setup.sh` now appends `--gpu` to gateway args only when `nvidia-smi` runs successfully.
Apt Package Manager Lock Management `scripts/brev-setup.sh`	Introduced exclusive apt access: wait for `cloud-init` (if present), stop/disable apt timers and unattended-upgrades, kill lingering apt/dpkg processes, clear lock files, run `dpkg --configure -a`, then poll until apt/dpkg are idle with a max timeout.
Installation Robustness & Logging `scripts/brev-setup.sh`	NodeSource (Node.js) installer and Docker install now retry up to 5 times with 30s backoff; NodeSource installer output is no longer silenced for improved logging.
CPU-only Docker Runtime Handling `scripts/brev-setup.sh`	When no GPU detected, check `/etc/docker/daemon.json` for `"default-runtime":"nvidia"`, remove that setting if present, and restart Docker so containers default to `runc`.
E2E Test Provisioning & Env Handling `test/e2e/brev-e2e.test.js`	Replaced `BREV_MIN_VCPU`/`BREV_MIN_RAM` with `BREV_INSTANCE_TYPE` (default `n2-standard-4`) and create instances with `--type`. Replaced `shellEscape()` with `shellQuote()` for env quoting. Added best-effort pre-test instance cleanup and `cloud-init status --wait` after sync before running `brev-setup.sh`.

Sequence Diagram

sequenceDiagram
    participant Test as E2E Test
    participant Script as brev-setup.sh
    participant CloudInit as cloud-init
    participant Apt as apt/dpkg System
    participant Docker as Docker Daemon
    participant NVIDIA as nvidia-smi

    Test->>Script: trigger setup
    Script->>NVIDIA: run nvidia-smi (runtime probe)
    NVIDIA-->>Script: success / failure

    alt GPU available
        Script->>Script: set HAS_GPU=true
        Script->>Docker: ensure nvidia default-runtime present / install NVIDIA Toolkit
        Script->>Script: install/start vLLM (GPU path)
    else CPU-only
        Script->>Docker: check /etc/docker/daemon.json
        Docker-->>Script: returns default-runtime
        Script->>Docker: remove nvidia default-runtime if present
        Script->>Docker: restart
    end

    Script->>CloudInit: wait (if cloud-init present)
    CloudInit-->>Script: ready

    Script->>Apt: stop apt timers & unattended-upgrades
    Script->>Apt: kill lingering apt/dpkg processes
    Script->>Apt: clear lock files & run dpkg --configure -a
    loop Poll for idle window
        Script->>Apt: lsof checks
        Apt-->>Script: active / idle
    end

    Script->>Script: perform installs (Node/Docker) with retries
    Script-->>Test: setup complete

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰
I nibble locks and probe the sky,
If GPUs sleep, I let them lie.
I hush apt's drums, then start the song,
Retry the dance when things go wrong.
Tests boot neat — the build hops strong.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main objective of the PR: hardening Brev E2E infrastructure to work reliably on CPU-only instances, which is the core focus across all three modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/brev-e2e.test.js`:
- Around line 137-143: The execSync invocation interpolates BREv_INSTANCE_TYPE
and INSTANCE_NAME into a shell string which risks shell injection or breaking on
spaces; replace the execSync call in the test (the line using execSync in
test/e2e/brev-e2e.test.js) with a safer call using child_process.execFileSync
(or spawnSync) passing the command ("brev") and args as an array [ "create",
"--type", BREV_INSTANCE_TYPE, INSTANCE_NAME, "--detached" ] and keep the same
options object (encoding, timeout, stdio), or alternatively ensure the
interpolated variables are properly quoted/escaped before passing to the shell;
target the execSync invocation that logs with elapsed() and change it
accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e23869cf-0bc6-41a7-86b4-3f2f112a1dc1

📥 Commits

Reviewing files that changed from the base of the PR and between e231d32 and d92dc54.

📒 Files selected for processing (3)

scripts/brev-setup.sh
scripts/setup.sh
test/e2e/brev-e2e.test.js

test/e2e/brev-e2e.test.js

Replaces execSync with execFileSync for the brev create call, passing args as an array instead of interpolating into a shell string. Avoids shell injection risk if BREV_INSTANCE_TYPE or INSTANCE_NAME ever contain spaces or metacharacters. Addresses CodeRabbit review feedback on PR NVIDIA#1266.

wscurran · 2026-04-01T20:56:50Z

✨ Thanks for submitting this pull request, which proposes a way to harden Brev E2E infrastructure for CPU-only instances.

Possibly related open issues:

#1179 [NemoClaw][Brev Launchable] Openclaw UI only shows the most recently onboarded sandbox configuration when two sandboxes exist
#1107 refactor(test): wire E2E security tests to real production code paths
#1171 [NemoClaw][MacOS][Brev] [tag v0.0.1] Base URL example is misleading as per other OpenAI-compatible endpoint

@zyang-dev

…ty (#1470) ## Summary - unify installer and onboarding host detection around shared TypeScript preflight logic - move `deploy` behavior into TypeScript, thin the Brev compatibility wrapper, and harden Brev readiness handling - demote or remove legacy platform-specific setup paths (`setup-spark`, `brev-setup.sh`) in favor of the canonical installer + onboard flow - update docs, CLI help, and Brev E2E coverage to match the new behavior ## What Changed - added shared host assessment and remediation planning in `src/lib/preflight.ts` - wired installer and onboard flows to the same host preflight decisions - changed Podman handling from hard block to unsupported-runtime warning - migrated deploy logic into `src/lib/deploy.ts` - updated `nemoclaw deploy` to use the authenticated Brev CLI, current Brev create flags, explicit GCP provider default, stricter readiness checks, and standard installer/onboard flow - removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh` to a deprecated compatibility wrapper - updated README/docs/help text and hardened the Brev E2E cleanup path ## Validation - `npm run build:cli` - targeted Vitest coverage for `src/lib/preflight.test.ts`, `src/lib/deploy.test.ts`, `test/install-preflight.test.js`, `test/cli.test.js`, `test/runner.test.js` - live Brev validation with `TEST_SUITE=deploy-cli` on `cpu-e2.4vcpu-16gb` - confirmed successful end-to-end remote deploy after waiting for Brev `status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY` ## Related Issues - Fixes #1377 - Addresses #1330 - Addresses #1390 - Related to #1404 ## Credit / Prior Work This branch builds on ideas and prior work from: - #1368 by @zyang-dev for simplifying Spark setup and removing the old cgroup workaround - #1395 and #1468 by @kjw3 for the thin installer/bootstrap direction and installer path reliability - #1450 by @cjagwani for switching Brev flows toward GCP for reliability - #1383 by @13ernkastel for the current Brev create flag compatibility work - #1364 by @WuKongAI-CMU for deploy sync-path fixes - #1362 and #1266 by @jyaunches for the Brev E2E/launchable infrastructure direction - issue ideas from #1377 and #1404 by @zNeill, #1330 by @Marcelo5444, and #1390 by @ericksoa  ## Summary by CodeRabbit * **New Features** * Improved host diagnostics with actionable remediation guidance surfaced during installer/onboard preflight. * **Improvements** * macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now uses the standard installer + `nemoclaw onboard`. * Preflight output shows detected runtime and WSL notes; installer prints remediation actions and will skip onboarding on blocking issues. * **Deprecations** * `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap wrapper are now deprecated compatibility paths. * **Documentation** * Quickstart, troubleshooting, and command reference updated to reflect installer+onboard flow and deprecation guidance. * **Tests** * Added/updated tests covering preflight, deploy compatibility, CLI aliases, and deploy e2e scenarios.

jyaunches mentioned this pull request Apr 1, 2026

test: add real runAgentInSandbox() E2E tests for telegram injection #1267

Closed

3 tasks

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

test/e2e/brev-e2e.test.js Show resolved Hide resolved

kjw3 mentioned this pull request Apr 4, 2026

refactor(installer): unify host preflight and thin deploy compatibility #1470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): harden Brev E2E infrastructure for CPU-only instances#1266

fix(e2e): harden Brev E2E infrastructure for CPU-only instances#1266
jyaunches wants to merge 2 commits intoNVIDIA:mainfrom
jyaunches:fix/brev-e2e-infra-hardening

jyaunches commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

wscurran commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jyaunches commented Apr 1, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

scripts/brev-setup.sh

scripts/setup.sh

test/e2e/brev-e2e.test.js

Testing

Type of Change

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wscurran commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jyaunches commented Apr 1, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 1, 2026 •

edited

Loading