fix(onboard): increase gateway health-poll window and preserve state on first retry by tommylin-signalpro · Pull Request #1325 · NVIDIA/NemoClaw

tommylin-signalpro · 2026-04-02T13:55:21Z

Summary

Raise default gateway health-poll from 5 × 2 s = 10 s to 30 × 5 s ≈ 2.5 min (still overridable via NEMOCLAW_HEALTH_POLL_COUNT / NEMOCLAW_HEALTH_POLL_INTERVAL)
Skip destroyGateway() on the first retry to preserve cached images and k3s cluster state; only destroy on subsequent retries
Print a "Waiting for gateway to become healthy..." message during the poll so the user knows the process is not stuck

Refs: #446

Root Cause

On first-run or resource-constrained environments the embedded k3s inside the OpenShell gateway can take 30–90 s to become fully healthy after openshell gateway start returns. The current 10 s health-poll window expires, triggering a retry that calls destroyGateway() — which removes Docker volumes and wipes cached images. The next attempt is therefore a cold start that is even slower, creating a timeout loop that never converges.

The code comment on line 1925 notes that retrying "lets the second attempt benefit from cached images", but destroyGateway() deletes those cached images, contradicting the stated intent.

Changes

bin/lib/onboard.js

Default NEMOCLAW_HEALTH_POLL_COUNT: 5 → 30
Default NEMOCLAW_HEALTH_POLL_INTERVAL: 2 → 5
onFailedAttempt: only call destroyGateway() when attemptNumber > 1
Print status message on first poll iteration

Testing

Code-level analysis of the retry path; the env var override mechanism is unchanged
Not verified on a natively supported platform (Linux / Apple Silicon) — tested through a Docker-wrapped openshell on macOS Intel + Colima where the gateway consistently needed > 10 s to become healthy
Existing test/gateway-cleanup.test.js assertions still hold (destroyGateway() is still present in the function block, just conditionally gated)

Test plan

Verify on Linux (native Docker) that the happy path is unaffected (gateway starts in < 10 s, no extra wait)
Verify on a slow first-run environment that the extended poll window avoids the destroy-rebuild loop
Confirm NEMOCLAW_HEALTH_POLL_COUNT=5 NEMOCLAW_HEALTH_POLL_INTERVAL=2 restores the old behavior

Signed-off-by: Tommy Lin tommylin@signalpro.com.tw

Summary by CodeRabbit

New Features
- Shows a one-time status message while waiting for the gateway to become healthy.
- Unified interactive policy preset flow with per-preset retry (up to 3 attempts) for transient failures.
- Clearer, explicit error reporting when applying policy presets fails.
Bug Fixes
- Increased readiness and startup polling windows for more robust startup on slow environments.
- Improved startup recovery to preserve gateway state on the first retry attempt.

coderabbitai · 2026-04-02T13:55:37Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4b2c3fed-5296-48e5-9243-07117506a2ab

📥 Commits

Reviewing files that changed from the base of the PR and between 13de2a9 and 8ee9abd.

📒 Files selected for processing (2)

bin/lib/onboard.js
bin/lib/policies.js

🚧 Files skipped from review as they are similar to previous changes (1)

bin/lib/policies.js

📝 Walkthrough

Walkthrough

Onboarding script extends polling/retry windows and tweaks retry cleanup; unified policy preset application with per-preset retries for transient sandbox errors; policy command execution now captures exit status and stderr to throw clearer errors.

Changes

Cohort / File(s)	Summary
Onboard script (gateway, sandbox, policies) `bin/lib/onboard.js`	Increased gateway health poll defaults (`NEMOCLAW_HEALTH_POLL_COUNT` 5→30, `NEMOCLAW_HEALTH_POLL_INTERVAL` 2→5) and sandbox readiness polling (`attempts` 10→15, `delaySeconds` 2→4). Added a one-time "Waiting for gateway to become healthy..." log on first non-terminal poll. Adjusted retry cleanup to skip `destroyGateway()` on first retry (`err.attemptNumber > 1`). Refactored `_setupPolicies` to compute a single `presetsToApply` array and apply each preset with up to 3 attempts, retrying on "sandbox not found" but immediately throwing on `Unimplemented` or other errors.
Policies command error handling `bin/lib/policies.js`	`applyPreset()` now calls the policy-set command with `{ ignoreError: true }`, inspects `result.status`, and throws an Error containing `stderr` (or `exit code <status>` fallback) when `status` is non-zero, instead of relying on `run` to throw.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as "Onboard CLI"
  participant Gateway as "Gateway (nemoClaw/k3s)"
  participant Policies as "Policies Service"
  participant Sandbox as "Sandbox"

  CLI->>Gateway: startGatewayWithOptions()
  Gateway-->>CLI: started
  CLI->>Gateway: healthPoll (repeat up to NEMOCLAW_HEALTH_POLL_COUNT)
  alt first poll delayed
    Note over CLI,Gateway: Console log "Waiting for gateway to become healthy..."
  end
  Gateway-->>CLI: healthy
  CLI->>Policies: applyPreset(preset) (run with ignoreError)
  Policies-->>CLI: error "sandbox not found"
  CLI->>Policies: retry applyPreset (up to 3 attempts)
  Policies-->>CLI: preset applied
  CLI->>Sandbox: waitForSandboxReady(sandboxName) (poll attempts)
  Sandbox-->>CLI: ready

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibbled timeouts, stretched each wait,

I kept the gateway whole for its first fate,
Three gentle tries for presets to land,
A quiet log to give a hand,
Now sandboxes hum across the land.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the two main changes: increased gateway health-poll window and preservation of state on first retry, matching the core objectives of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 3127-3153: The retry-on-"sandbox not found" is broken because
policies.applyPreset(sandboxName, name) internally calls
run(buildPolicySetCommand(...)) which calls process.exit on failure so the
try/catch never sees the command failure; change policies.applyPreset (or its
caller) to run the policy-set command with ignoreError: true (or otherwise
invoke the underlying command without process.exit) so you can inspect the
returned status and implement the retry loop: update policies.applyPreset or add
a new method that returns a result object (success/failure and message) instead
of exiting, then in the loop around policies.applyPreset(sandboxName, name)
check the result for "sandbox not found" and retry up to 3 times with sleep(2),
preserving the existing Unimplemented handling.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3567791d-5c2f-42f0-a73b-ef1082ba5819

📥 Commits

Reviewing files that changed from the base of the PR and between eed6d02 and 0094bb6.

📒 Files selected for processing (1)

bin/lib/onboard.js

bin/lib/onboard.js

…on first retry The default health-poll window (5 × 2 s = 10 s) is too short for resource-constrained environments (Colima on Intel Mac, low-RAM VMs, first-run cold caches). When the poll expires the retry logic destroys the gateway — wiping cached images and k3s state — which makes the next attempt even slower, creating a timeout loop that never converges. Changes: - Raise default poll to 30 × 5 s (≈ 2.5 min), still overridable via NEMOCLAW_HEALTH_POLL_COUNT / NEMOCLAW_HEALTH_POLL_INTERVAL env vars. - Skip destroyGateway() on the first retry so cached images and cluster state are preserved; only destroy on subsequent retries. - Print a "Waiting for gateway..." message so the user knows the process is not stuck. Refs: NVIDIA#446 Signed-off-by: Tommy Lin <tommylin@signalpro.com.tw>

…policy apply waitForSandboxReady defaults were 10 × 2s = 20s which is too short on Colima/slow environments. Increase to 15 × 4s = 60s. The interactive policy-apply paths (both "Y" for suggested and "list" for user-picked presets) had no retry on transient "sandbox not found" errors, unlike the non-interactive path which retries 3 times. Unify both interactive branches into a single loop with the same retry logic. Signed-off-by: Tommy Lin <tommylin@signalpro.com.tw>

applyPreset() called run() without ignoreError, causing process.exit() on command failure. This made the try/catch retry logic in onboard _setupPolicies unreachable — the process would terminate before the catch block could handle transient "sandbox not found" errors. Change to run() with ignoreError: true and throw an Error with the stderr message, so the existing retry loops in both interactive and non-interactive onboard paths can actually catch and retry. Signed-off-by: Tommy Lin <tommylin@signalpro.com.tw>

wscurran · 2026-04-03T04:42:54Z

✨ Thanks for submitting this pull request, which proposes a way to improve the onboarding experience by increasing the gateway health-poll window and preserving k3s cluster state on first retry.

tommylin-signalpro · 2026-04-07T10:09:25Z

Closing — the core timeout issue was already addressed by #1051 (exponential backoff retry, merged 3/30). The remaining incremental changes (larger poll window, skip destroy on first retry) overlap with the design decisions in #1051, and the _setupPolicies/policies.js changes have been superseded by #1195 and #1463.

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

bin/lib/onboard.js Show resolved Hide resolved

tommylin-signalpro added 3 commits April 3, 2026 10:27

tommylin-signalpro force-pushed the fix/gateway-health-poll-timeout branch from 13de2a9 to 8ee9abd Compare April 3, 2026 02:30

Merge branch 'main' into fix/gateway-health-poll-timeout

3adfbe4

wscurran added bug Something isn't working NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). labels Apr 3, 2026

tommylin-signalpro closed this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onboard): increase gateway health-poll window and preserve state on first retry#1325

fix(onboard): increase gateway health-poll window and preserve state on first retry#1325
tommylin-signalpro wants to merge 4 commits intoNVIDIA:mainfrom
tommylin-signalpro:fix/gateway-health-poll-timeout

tommylin-signalpro commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

wscurran commented Apr 3, 2026

Uh oh!

tommylin-signalpro commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tommylin-signalpro commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

Testing

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wscurran commented Apr 3, 2026

Uh oh!

tommylin-signalpro commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tommylin-signalpro commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading