Skip to content

fix(e2e): harden Brev E2E infrastructure for CPU-only instances#1266

Open
jyaunches wants to merge 2 commits intoNVIDIA:mainfrom
jyaunches:fix/brev-e2e-infra-hardening
Open

fix(e2e): harden Brev E2E infrastructure for CPU-only instances#1266
jyaunches wants to merge 2 commits intoNVIDIA:mainfrom
jyaunches:fix/brev-e2e-infra-hardening

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented Apr 1, 2026

Summary

Shared E2E infrastructure improvements that make Brev-based CI reliable on CPU-only instances. Split from #1219 to keep concerns separate.

Changes

scripts/brev-setup.sh

  • Extract HAS_GPU flag for consistent GPU detection — nvidia-smi must run successfully, not just exist on PATH (Brev GPU images ship the binary even on CPU instances)
  • All GPU gates (container toolkit, Docker runtime reset, vLLM) use the single HAS_GPU flag
  • Replace cloud-init || true with proper error check + warning (no longer silently swallows failures)
  • Add timeout (300s) and quiet window (5s) to apt wait loop to prevent indefinite hangs and races where apt briefly pauses between operations
  • Add retry loops (5 attempts, 30s backoff) for Node.js and Docker apt-get installs
  • Unsuppress NodeSource installer output for debugging
  • Reset Docker default runtime to runc on CPU-only instances

scripts/setup.sh

  • Check nvidia-smi exit code instead of just PATH presence for GPU gateway flag

test/e2e/brev-e2e.test.js

  • Use known GCP instance type (n2-standard-4) instead of cpu search
  • Add shellQuote() for safe secret interpolation in SSH commands
  • Delete leftover instances before creating new ones
  • Wait for cloud-init before running bootstrap

Testing

  • bash -n scripts/brev-setup.sh — syntax check passes
  • bash -n scripts/setup.sh — syntax check passes
  • npm test — CLI tests pass
  • shellcheck passes (via pre-commit)

Type of Change

  • Code change for a new feature, bug fix, or refactor.
  • Code change with doc updates.
  • Doc only.

Summary by CodeRabbit

  • Bug Fixes

    • More reliable GPU detection by probing runtime availability instead of binary presence
    • Improved apt access handling with automatic lock clearing, background timer suppression, and wait/retry for idle state
    • Automatic Docker runtime fallback to CPU when no GPU is available
  • Chores

    • Centralized GPU flag to unify GPU-gated behavior
    • Retry logic with backoff for key installers and more visible installer output
  • Tests

    • More robust e2e provisioning and VM readiness checks, plus safer remote quoting

Shared E2E infrastructure improvements that make Brev-based CI reliable
on CPU-only instances:

brev-setup.sh:
- Extract HAS_GPU flag for consistent GPU detection — nvidia-smi must
  run successfully, not just exist on PATH (Brev GPU images ship the
  binary even on CPU instances)
- All GPU gates (container toolkit, Docker runtime reset, vLLM) use
  the single HAS_GPU flag
- Replace cloud-init || true with proper error check + warning
- Add timeout (300s) and quiet window (5s) to apt wait loop to prevent
  indefinite hangs and races
- Add retry loops (5 attempts, 30s backoff) for Node.js and Docker
  apt-get installs
- Unsuppress NodeSource installer output for debugging
- Reset Docker default runtime to runc on CPU-only instances where
  Brev pre-configures nvidia as default

setup.sh:
- Check nvidia-smi exit code instead of just PATH presence for GPU
  gateway flag

brev-e2e.test.js:
- Use known GCP instance type (n2-standard-4) instead of cpu search
- Add shellQuote() for safe secret interpolation in SSH commands
- Delete leftover instances before creating new ones
- Wait for cloud-init before running bootstrap
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1691e813-e57c-4538-ba7a-1b034ce6672f

📥 Commits

Reviewing files that changed from the base of the PR and between d92dc54 and dafcaba.

📒 Files selected for processing (1)
  • test/e2e/brev-e2e.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/brev-e2e.test.js

📝 Walkthrough

Walkthrough

The changes centralize GPU detection via a runtime probe, add an exclusive apt-access sequence to avoid package manager conflicts, add retry/backoff and logging for Node.js and Docker installs, adjust Docker daemon for CPU-only systems, and change E2E test instance provisioning to use a direct instance type with improved env quoting and cloud-init waits.

Changes

Cohort / File(s) Summary
GPU Detection & Runtime Configuration
scripts/brev-setup.sh, scripts/setup.sh
Centralized GPU detection using a runtime nvidia-smi probe. Gate NVIDIA Container Toolkit and vLLM install/start with HAS_GPU. setup.sh now appends --gpu to gateway args only when nvidia-smi runs successfully.
Apt Package Manager Lock Management
scripts/brev-setup.sh
Introduced exclusive apt access: wait for cloud-init (if present), stop/disable apt timers and unattended-upgrades, kill lingering apt/dpkg processes, clear lock files, run dpkg --configure -a, then poll until apt/dpkg are idle with a max timeout.
Installation Robustness & Logging
scripts/brev-setup.sh
NodeSource (Node.js) installer and Docker install now retry up to 5 times with 30s backoff; NodeSource installer output is no longer silenced for improved logging.
CPU-only Docker Runtime Handling
scripts/brev-setup.sh
When no GPU detected, check /etc/docker/daemon.json for "default-runtime":"nvidia", remove that setting if present, and restart Docker so containers default to runc.
E2E Test Provisioning & Env Handling
test/e2e/brev-e2e.test.js
Replaced BREV_MIN_VCPU/BREV_MIN_RAM with BREV_INSTANCE_TYPE (default n2-standard-4) and create instances with --type. Replaced shellEscape() with shellQuote() for env quoting. Added best-effort pre-test instance cleanup and cloud-init status --wait after sync before running brev-setup.sh.

Sequence Diagram

sequenceDiagram
    participant Test as E2E Test
    participant Script as brev-setup.sh
    participant CloudInit as cloud-init
    participant Apt as apt/dpkg System
    participant Docker as Docker Daemon
    participant NVIDIA as nvidia-smi

    Test->>Script: trigger setup
    Script->>NVIDIA: run nvidia-smi (runtime probe)
    NVIDIA-->>Script: success / failure

    alt GPU available
        Script->>Script: set HAS_GPU=true
        Script->>Docker: ensure nvidia default-runtime present / install NVIDIA Toolkit
        Script->>Script: install/start vLLM (GPU path)
    else CPU-only
        Script->>Docker: check /etc/docker/daemon.json
        Docker-->>Script: returns default-runtime
        Script->>Docker: remove nvidia default-runtime if present
        Script->>Docker: restart
    end

    Script->>CloudInit: wait (if cloud-init present)
    CloudInit-->>Script: ready

    Script->>Apt: stop apt timers & unattended-upgrades
    Script->>Apt: kill lingering apt/dpkg processes
    Script->>Apt: clear lock files & run dpkg --configure -a
    loop Poll for idle window
        Script->>Apt: lsof checks
        Apt-->>Script: active / idle
    end

    Script->>Script: perform installs (Node/Docker) with retries
    Script-->>Test: setup complete
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰
I nibble locks and probe the sky,
If GPUs sleep, I let them lie.
I hush apt's drums, then start the song,
Retry the dance when things go wrong.
Tests boot neat — the build hops strong.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective of the PR: hardening Brev E2E infrastructure to work reliably on CPU-only instances, which is the core focus across all three modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/brev-e2e.test.js`:
- Around line 137-143: The execSync invocation interpolates BREv_INSTANCE_TYPE
and INSTANCE_NAME into a shell string which risks shell injection or breaking on
spaces; replace the execSync call in the test (the line using execSync in
test/e2e/brev-e2e.test.js) with a safer call using child_process.execFileSync
(or spawnSync) passing the command ("brev") and args as an array [ "create",
"--type", BREV_INSTANCE_TYPE, INSTANCE_NAME, "--detached" ] and keep the same
options object (encoding, timeout, stdio), or alternatively ensure the
interpolated variables are properly quoted/escaped before passing to the shell;
target the execSync invocation that logs with elapsed() and change it
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e23869cf-0bc6-41a7-86b4-3f2f112a1dc1

📥 Commits

Reviewing files that changed from the base of the PR and between e231d32 and d92dc54.

📒 Files selected for processing (3)
  • scripts/brev-setup.sh
  • scripts/setup.sh
  • test/e2e/brev-e2e.test.js

Replaces execSync with execFileSync for the brev create call, passing
args as an array instead of interpolating into a shell string. Avoids
shell injection risk if BREV_INSTANCE_TYPE or INSTANCE_NAME ever
contain spaces or metacharacters.

Addresses CodeRabbit review feedback on PR NVIDIA#1266.
@wscurran wscurran added status: triage For new items that haven't been reviewed yet. bug Something isn't working good first issue Good for newcomers Platform: Brev Support for Brev deployment CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. and removed status: triage For new items that haven't been reviewed yet. labels Apr 1, 2026
@wscurran
Copy link
Copy Markdown
Contributor

wscurran commented Apr 1, 2026

ericksoa pushed a commit that referenced this pull request Apr 4, 2026
…ty (#1470)

## Summary
- unify installer and onboarding host detection around shared TypeScript
preflight logic
- move `deploy` behavior into TypeScript, thin the Brev compatibility
wrapper, and harden Brev readiness handling
- demote or remove legacy platform-specific setup paths (`setup-spark`,
`brev-setup.sh`) in favor of the canonical installer + onboard flow
- update docs, CLI help, and Brev E2E coverage to match the new behavior

## What Changed
- added shared host assessment and remediation planning in
`src/lib/preflight.ts`
- wired installer and onboard flows to the same host preflight decisions
- changed Podman handling from hard block to unsupported-runtime warning
- migrated deploy logic into `src/lib/deploy.ts`
- updated `nemoclaw deploy` to use the authenticated Brev CLI, current
Brev create flags, explicit GCP provider default, stricter readiness
checks, and standard installer/onboard flow
- removed `scripts/setup-spark.sh` and reduced `scripts/brev-setup.sh`
to a deprecated compatibility wrapper
- updated README/docs/help text and hardened the Brev E2E cleanup path

## Validation
- `npm run build:cli`
- targeted Vitest coverage for `src/lib/preflight.test.ts`,
`src/lib/deploy.test.ts`, `test/install-preflight.test.js`,
`test/cli.test.js`, `test/runner.test.js`
- live Brev validation with `TEST_SUITE=deploy-cli` on
`cpu-e2.4vcpu-16gb`
- confirmed successful end-to-end remote deploy after waiting for Brev
`status=RUNNING`, `build_status=COMPLETED`, `shell_status=READY`

## Related Issues
- Fixes #1377
- Addresses #1330
- Addresses #1390
- Related to #1404

## Credit / Prior Work
This branch builds on ideas and prior work from:
- #1368 by @zyang-dev for simplifying Spark setup and removing the old
cgroup workaround
- #1395 and #1468 by @kjw3 for the thin installer/bootstrap direction
and installer path reliability
- #1450 by @cjagwani for switching Brev flows toward GCP for reliability
- #1383 by @13ernkastel for the current Brev create flag compatibility
work
- #1364 by @WuKongAI-CMU for deploy sync-path fixes
- #1362 and #1266 by @jyaunches for the Brev E2E/launchable
infrastructure direction
- issue ideas from #1377 and #1404 by @zNeill, #1330 by @Marcelo5444,
and #1390 by @ericksoa


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved host diagnostics with actionable remediation guidance
surfaced during installer/onboard preflight.

* **Improvements**
* macOS (Intel) now recommends Docker Desktop; DGX Spark guidance now
uses the standard installer + `nemoclaw onboard`.
* Preflight output shows detected runtime and WSL notes; installer
prints remediation actions and will skip onboarding on blocking issues.

* **Deprecations**
* `nemoclaw deploy`, `nemoclaw setup-spark`, and the legacy bootstrap
wrapper are now deprecated compatibility paths.

* **Documentation**
* Quickstart, troubleshooting, and command reference updated to reflect
installer+onboard flow and deprecation guidance.

* **Tests**
* Added/updated tests covering preflight, deploy compatibility, CLI
aliases, and deploy e2e scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CI/CD Use this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions. good first issue Good for newcomers Platform: Brev Support for Brev deployment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants