fix(onboard): increase gateway health-poll window and preserve state on first retry#1325
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughOnboarding script extends polling/retry windows and tweaks retry cleanup; unified policy preset application with per-preset retries for transient sandbox errors; policy command execution now captures exit status and stderr to throw clearer errors. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as "Onboard CLI"
participant Gateway as "Gateway (nemoClaw/k3s)"
participant Policies as "Policies Service"
participant Sandbox as "Sandbox"
CLI->>Gateway: startGatewayWithOptions()
Gateway-->>CLI: started
CLI->>Gateway: healthPoll (repeat up to NEMOCLAW_HEALTH_POLL_COUNT)
alt first poll delayed
Note over CLI,Gateway: Console log "Waiting for gateway to become healthy..."
end
Gateway-->>CLI: healthy
CLI->>Policies: applyPreset(preset) (run with ignoreError)
Policies-->>CLI: error "sandbox not found"
CLI->>Policies: retry applyPreset (up to 3 attempts)
Policies-->>CLI: preset applied
CLI->>Sandbox: waitForSandboxReady(sandboxName) (poll attempts)
Sandbox-->>CLI: ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 3127-3153: The retry-on-"sandbox not found" is broken because
policies.applyPreset(sandboxName, name) internally calls
run(buildPolicySetCommand(...)) which calls process.exit on failure so the
try/catch never sees the command failure; change policies.applyPreset (or its
caller) to run the policy-set command with ignoreError: true (or otherwise
invoke the underlying command without process.exit) so you can inspect the
returned status and implement the retry loop: update policies.applyPreset or add
a new method that returns a result object (success/failure and message) instead
of exiting, then in the loop around policies.applyPreset(sandboxName, name)
check the result for "sandbox not found" and retry up to 3 times with sleep(2),
preserving the existing Unimplemented handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3567791d-5c2f-42f0-a73b-ef1082ba5819
📒 Files selected for processing (1)
bin/lib/onboard.js
…on first retry The default health-poll window (5 × 2 s = 10 s) is too short for resource-constrained environments (Colima on Intel Mac, low-RAM VMs, first-run cold caches). When the poll expires the retry logic destroys the gateway — wiping cached images and k3s state — which makes the next attempt even slower, creating a timeout loop that never converges. Changes: - Raise default poll to 30 × 5 s (≈ 2.5 min), still overridable via NEMOCLAW_HEALTH_POLL_COUNT / NEMOCLAW_HEALTH_POLL_INTERVAL env vars. - Skip destroyGateway() on the first retry so cached images and cluster state are preserved; only destroy on subsequent retries. - Print a "Waiting for gateway..." message so the user knows the process is not stuck. Refs: NVIDIA#446 Signed-off-by: Tommy Lin <tommylin@signalpro.com.tw>
…policy apply waitForSandboxReady defaults were 10 × 2s = 20s which is too short on Colima/slow environments. Increase to 15 × 4s = 60s. The interactive policy-apply paths (both "Y" for suggested and "list" for user-picked presets) had no retry on transient "sandbox not found" errors, unlike the non-interactive path which retries 3 times. Unify both interactive branches into a single loop with the same retry logic. Signed-off-by: Tommy Lin <tommylin@signalpro.com.tw>
applyPreset() called run() without ignoreError, causing process.exit() on command failure. This made the try/catch retry logic in onboard _setupPolicies unreachable — the process would terminate before the catch block could handle transient "sandbox not found" errors. Change to run() with ignoreError: true and throw an Error with the stderr message, so the existing retry loops in both interactive and non-interactive onboard paths can actually catch and retry. Signed-off-by: Tommy Lin <tommylin@signalpro.com.tw>
13de2a9 to
8ee9abd
Compare
|
✨ Thanks for submitting this pull request, which proposes a way to improve the onboarding experience by increasing the gateway health-poll window and preserving k3s cluster state on first retry. |
|
Closing — the core timeout issue was already addressed by #1051 (exponential backoff retry, merged 3/30). The remaining incremental changes (larger poll window, skip destroy on first retry) overlap with the design decisions in #1051, and the |
Summary
5 × 2 s = 10 sto30 × 5 s ≈ 2.5 min(still overridable viaNEMOCLAW_HEALTH_POLL_COUNT/NEMOCLAW_HEALTH_POLL_INTERVAL)destroyGateway()on the first retry to preserve cached images and k3s cluster state; only destroy on subsequent retriesRefs: #446
Root Cause
On first-run or resource-constrained environments the embedded k3s inside the OpenShell gateway can take 30–90 s to become fully healthy after
openshell gateway startreturns. The current 10 s health-poll window expires, triggering a retry that callsdestroyGateway()— which removes Docker volumes and wipes cached images. The next attempt is therefore a cold start that is even slower, creating a timeout loop that never converges.The code comment on line 1925 notes that retrying "lets the second attempt benefit from cached images", but
destroyGateway()deletes those cached images, contradicting the stated intent.Changes
bin/lib/onboard.jsNEMOCLAW_HEALTH_POLL_COUNT:5→30NEMOCLAW_HEALTH_POLL_INTERVAL:2→5onFailedAttempt: only calldestroyGateway()whenattemptNumber > 1Testing
test/gateway-cleanup.test.jsassertions still hold (destroyGateway()is still present in the function block, just conditionally gated)Test plan
NEMOCLAW_HEALTH_POLL_COUNT=5 NEMOCLAW_HEALTH_POLL_INTERVAL=2restores the old behaviorSigned-off-by: Tommy Lin tommylin@signalpro.com.tw
Summary by CodeRabbit
New Features
Bug Fixes