Skip to content

fix(onboard): increase gateway health-poll window and preserve state on first retry#1325

Closed
tommylin-signalpro wants to merge 4 commits intoNVIDIA:mainfrom
tommylin-signalpro:fix/gateway-health-poll-timeout
Closed

fix(onboard): increase gateway health-poll window and preserve state on first retry#1325
tommylin-signalpro wants to merge 4 commits intoNVIDIA:mainfrom
tommylin-signalpro:fix/gateway-health-poll-timeout

Conversation

@tommylin-signalpro
Copy link
Copy Markdown
Contributor

@tommylin-signalpro tommylin-signalpro commented Apr 2, 2026

Summary

  • Raise default gateway health-poll from 5 × 2 s = 10 s to 30 × 5 s ≈ 2.5 min (still overridable via NEMOCLAW_HEALTH_POLL_COUNT / NEMOCLAW_HEALTH_POLL_INTERVAL)
  • Skip destroyGateway() on the first retry to preserve cached images and k3s cluster state; only destroy on subsequent retries
  • Print a "Waiting for gateway to become healthy..." message during the poll so the user knows the process is not stuck

Refs: #446

Root Cause

On first-run or resource-constrained environments the embedded k3s inside the OpenShell gateway can take 30–90 s to become fully healthy after openshell gateway start returns. The current 10 s health-poll window expires, triggering a retry that calls destroyGateway() — which removes Docker volumes and wipes cached images. The next attempt is therefore a cold start that is even slower, creating a timeout loop that never converges.

The code comment on line 1925 notes that retrying "lets the second attempt benefit from cached images", but destroyGateway() deletes those cached images, contradicting the stated intent.

Changes

bin/lib/onboard.js

  • Default NEMOCLAW_HEALTH_POLL_COUNT: 530
  • Default NEMOCLAW_HEALTH_POLL_INTERVAL: 25
  • onFailedAttempt: only call destroyGateway() when attemptNumber > 1
  • Print status message on first poll iteration

Testing

  • Code-level analysis of the retry path; the env var override mechanism is unchanged
  • Not verified on a natively supported platform (Linux / Apple Silicon) — tested through a Docker-wrapped openshell on macOS Intel + Colima where the gateway consistently needed > 10 s to become healthy
  • Existing test/gateway-cleanup.test.js assertions still hold (destroyGateway() is still present in the function block, just conditionally gated)

Test plan

  • Verify on Linux (native Docker) that the happy path is unaffected (gateway starts in < 10 s, no extra wait)
  • Verify on a slow first-run environment that the extended poll window avoids the destroy-rebuild loop
  • Confirm NEMOCLAW_HEALTH_POLL_COUNT=5 NEMOCLAW_HEALTH_POLL_INTERVAL=2 restores the old behavior

Signed-off-by: Tommy Lin tommylin@signalpro.com.tw

Summary by CodeRabbit

  • New Features

    • Shows a one-time status message while waiting for the gateway to become healthy.
    • Unified interactive policy preset flow with per-preset retry (up to 3 attempts) for transient failures.
    • Clearer, explicit error reporting when applying policy presets fails.
  • Bug Fixes

    • Increased readiness and startup polling windows for more robust startup on slow environments.
    • Improved startup recovery to preserve gateway state on the first retry attempt.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4b2c3fed-5296-48e5-9243-07117506a2ab

📥 Commits

Reviewing files that changed from the base of the PR and between 13de2a9 and 8ee9abd.

📒 Files selected for processing (2)
  • bin/lib/onboard.js
  • bin/lib/policies.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • bin/lib/policies.js

📝 Walkthrough

Walkthrough

Onboarding script extends polling/retry windows and tweaks retry cleanup; unified policy preset application with per-preset retries for transient sandbox errors; policy command execution now captures exit status and stderr to throw clearer errors.

Changes

Cohort / File(s) Summary
Onboard script (gateway, sandbox, policies)
bin/lib/onboard.js
Increased gateway health poll defaults (NEMOCLAW_HEALTH_POLL_COUNT 5→30, NEMOCLAW_HEALTH_POLL_INTERVAL 2→5) and sandbox readiness polling (attempts 10→15, delaySeconds 2→4). Added a one-time "Waiting for gateway to become healthy..." log on first non-terminal poll. Adjusted retry cleanup to skip destroyGateway() on first retry (err.attemptNumber > 1). Refactored _setupPolicies to compute a single presetsToApply array and apply each preset with up to 3 attempts, retrying on "sandbox not found" but immediately throwing on Unimplemented or other errors.
Policies command error handling
bin/lib/policies.js
applyPreset() now calls the policy-set command with { ignoreError: true }, inspects result.status, and throws an Error containing stderr (or exit code <status> fallback) when status is non-zero, instead of relying on run to throw.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as "Onboard CLI"
  participant Gateway as "Gateway (nemoClaw/k3s)"
  participant Policies as "Policies Service"
  participant Sandbox as "Sandbox"

  CLI->>Gateway: startGatewayWithOptions()
  Gateway-->>CLI: started
  CLI->>Gateway: healthPoll (repeat up to NEMOCLAW_HEALTH_POLL_COUNT)
  alt first poll delayed
    Note over CLI,Gateway: Console log "Waiting for gateway to become healthy..."
  end
  Gateway-->>CLI: healthy
  CLI->>Policies: applyPreset(preset) (run with ignoreError)
  Policies-->>CLI: error "sandbox not found"
  CLI->>Policies: retry applyPreset (up to 3 attempts)
  Policies-->>CLI: preset applied
  CLI->>Sandbox: waitForSandboxReady(sandboxName) (poll attempts)
  Sandbox-->>CLI: ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibbled timeouts, stretched each wait,

I kept the gateway whole for its first fate,
Three gentle tries for presets to land,
A quiet log to give a hand,
Now sandboxes hum across the land.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main changes: increased gateway health-poll window and preservation of state on first retry, matching the core objectives of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 3127-3153: The retry-on-"sandbox not found" is broken because
policies.applyPreset(sandboxName, name) internally calls
run(buildPolicySetCommand(...)) which calls process.exit on failure so the
try/catch never sees the command failure; change policies.applyPreset (or its
caller) to run the policy-set command with ignoreError: true (or otherwise
invoke the underlying command without process.exit) so you can inspect the
returned status and implement the retry loop: update policies.applyPreset or add
a new method that returns a result object (success/failure and message) instead
of exiting, then in the loop around policies.applyPreset(sandboxName, name)
check the result for "sandbox not found" and retry up to 3 times with sleep(2),
preserving the existing Unimplemented handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3567791d-5c2f-42f0-a73b-ef1082ba5819

📥 Commits

Reviewing files that changed from the base of the PR and between eed6d02 and 0094bb6.

📒 Files selected for processing (1)
  • bin/lib/onboard.js

…on first retry

The default health-poll window (5 × 2 s = 10 s) is too short for
resource-constrained environments (Colima on Intel Mac, low-RAM VMs,
first-run cold caches).  When the poll expires the retry logic destroys
the gateway — wiping cached images and k3s state — which makes the
next attempt even slower, creating a timeout loop that never converges.

Changes:
- Raise default poll to 30 × 5 s (≈ 2.5 min), still overridable via
  NEMOCLAW_HEALTH_POLL_COUNT / NEMOCLAW_HEALTH_POLL_INTERVAL env vars.
- Skip destroyGateway() on the first retry so cached images and
  cluster state are preserved; only destroy on subsequent retries.
- Print a "Waiting for gateway..." message so the user knows the
  process is not stuck.

Refs: NVIDIA#446

Signed-off-by: Tommy Lin <tommylin@signalpro.com.tw>
…policy apply

waitForSandboxReady defaults were 10 × 2s = 20s which is too short on
Colima/slow environments. Increase to 15 × 4s = 60s.

The interactive policy-apply paths (both "Y" for suggested and "list"
for user-picked presets) had no retry on transient "sandbox not found"
errors, unlike the non-interactive path which retries 3 times. Unify
both interactive branches into a single loop with the same retry logic.

Signed-off-by: Tommy Lin <tommylin@signalpro.com.tw>
applyPreset() called run() without ignoreError, causing process.exit()
on command failure. This made the try/catch retry logic in onboard
_setupPolicies unreachable — the process would terminate before the
catch block could handle transient "sandbox not found" errors.

Change to run() with ignoreError: true and throw an Error with the
stderr message, so the existing retry loops in both interactive and
non-interactive onboard paths can actually catch and retry.

Signed-off-by: Tommy Lin <tommylin@signalpro.com.tw>
@tommylin-signalpro tommylin-signalpro force-pushed the fix/gateway-health-poll-timeout branch from 13de2a9 to 8ee9abd Compare April 3, 2026 02:30
@wscurran wscurran added bug Something isn't working NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). labels Apr 3, 2026
@wscurran
Copy link
Copy Markdown
Contributor

wscurran commented Apr 3, 2026

✨ Thanks for submitting this pull request, which proposes a way to improve the onboarding experience by increasing the gateway health-poll window and preserving k3s cluster state on first retry.

@tommylin-signalpro
Copy link
Copy Markdown
Contributor Author

Closing — the core timeout issue was already addressed by #1051 (exponential backoff retry, merged 3/30). The remaining incremental changes (larger poll window, skip destroy on first retry) overlap with the design decisions in #1051, and the _setupPolicies/policies.js changes have been superseded by #1195 and #1463.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants