Skip to content

test: add real runAgentInSandbox() E2E tests for telegram injection (Phases 7-11)#1219

Closed
jyaunches wants to merge 5 commits intoNVIDIA:mainfrom
jyaunches:test/telegram-injection-only
Closed

test: add real runAgentInSandbox() E2E tests for telegram injection (Phases 7-11)#1219
jyaunches wants to merge 5 commits intoNVIDIA:mainfrom
jyaunches:test/telegram-injection-only

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented Apr 1, 2026

Summary

Adds Phases 7-11 to test/e2e/test-telegram-injection.sh addressing PR #1092 feedback from @cv:

test/e2e/test-telegram-injection.sh does not exercise scripts/telegram-bridge.js / runAgentInSandbox(). It uses ad-hoc SSH commands like MSG=$(cat) && echo ..., which bypass the actual quoting/session-handling path we need confidence in.

These tests exercise the real production code path by invoking bridge.runAgentInSandbox() and bridge.sanitizeSessionId() directly via Node.js.

New Test Phases

Phase Tests What it validates
Phase 7 T9-T11 Real runAgentInSandbox() with $(cmd), backtick payloads, API key not in process table
Phase 8 T12-T13 Multi-line message injection via real code path
Phase 9 T14-T15 sanitizeSessionId() strips leading hyphens (option injection prevention)
Phase 10 T16-T17 Empty/special-character session ID edge cases
Phase 11 Cleanup

Expected Behavior

These tests are expected to FAIL on current main because the security fix (scripts/telegram-bridge.js changes from #119) is not yet merged. Specifically:

  • T9-T13: PASS (ad-hoc SSH injection tests work on unfixed code too)
  • T14-T17: FAIL with TELEGRAM_BOT_TOKEN required because the unfixed bridge does not export sanitizeSessionId/initConfig

The follow-up PR with the actual fix will make all tests pass.

E2E Infrastructure Fixes

Also includes reliability fixes for Brev CPU-only E2E instances:

  • wait_for_apt() with retry loops for apt lock contention on boot
  • nvidia-smi exit code check (not just PATH presence) before enabling GPU
  • Reset Docker default runtime to runc on CPU-only instances
  • Delete leftover instances before creating new ones

Validation

E2E run on fork demonstrating expected failures: https://github.com/jyaunches/NemoClaw/actions/runs/23827384671

Refs: #118, PR #1092

Summary by CodeRabbit

  • Improvements

    • More reliable GPU detection (checks functional GPU availability), waits for system provisioning before installs, shows installer output, and handles CPU-only hosts by restoring Docker runtime when needed.
  • CI

    • Workflow broadened to run for an additional repository.
  • Tests

    • Expanded end-to-end tests for injection resistance, process isolation, session-id sanitization, provisioning sync, and cleanup.

…box()

Add Phases 7-11 to test-telegram-injection.sh addressing PR NVIDIA#1092
feedback from @cv: exercise real production code paths instead of
ad-hoc SSH commands.

New phases:
- Phase 7: Real runAgentInSandbox() with injection payloads (T9-T11)
- Phase 8: Multi-line message handling (T12-T13)
- Phase 9: Session ID option injection prevention (T14-T15)
- Phase 10: Empty and edge case messages (T16-T17)
- Phase 11: Cleanup

These tests are expected to FAIL on unfixed code, proving they
detect the vulnerability.

Refs: NVIDIA#118, PR NVIDIA#1092
- Add wait_for_apt() with retry loops for apt lock contention
- Unsuppress NodeSource installer output for debugging
- Delete leftover instances before creating new ones
- Check nvidia-smi exit code (not just PATH presence) before
  enabling GPU in gateway start and container toolkit setup
- Reset Docker default runtime to runc on CPU-only instances
- Simplify env var passing to remote VM
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Broadened CI trigger for the e2e job; added cloud-init synchronization before provisioning; tightened GPU detection to require a runnable nvidia-smi and added CPU-only Docker runtime fallback; removed output-suppression for installers; introduced large E2E test additions for Telegram injection and session-id sanitization.

Changes

Cohort / File(s) Summary
CI Workflow
\.github/workflows/e2e-brev.yaml
Expanded the job if condition so e2e-brev runs for both NVIDIA/NemoClaw and jyaunches/NemoClaw.
Provisioning & Setup Scripts
scripts/brev-setup.sh, scripts/setup.sh
Add cloud-init wait pre-provisioning; remove stdout/stderr suppression for Node/Docker installer/apt calls; require nvidia-smi to run (not just exist); handle CPU-only hosts by removing default-runtime: "nvidia" from Docker daemon and restarting Docker.
E2E Test Harness / Sync
test/e2e/brev-e2e.test.js
Use shellQuote for secret exports; add best-effort brev delete teardown; wait for cloud-init completion (cloud-init status --wait) before running provisioning; increase SSH timeout for that wait.
Injection & Sanitization E2E Tests
test/e2e/test-telegram-injection.sh
Add extensive tests that invoke real runAgentInSandbox() via a temp Node harness, assert against $()/backtick and multi-line injections, verify absence of sandbox marker files and sensitive API keys in process args, expand sanitizeSessionId() edge-case checks, and add cleanup for new artifacts.

Sequence Diagram(s)

sequenceDiagram
    participant Tester as Tester
    participant SSH as SSH
    participant VM as Remote VM
    participant CloudInit as cloud-init
    participant Setup as brev-setup.sh
    participant Docker as Docker

    Tester->>SSH: sync code & connect to VM
    SSH->>VM: deliver files
    Tester->>VM: wait for cloud-init (cloud-init status --wait)
    VM->>CloudInit: block until finished
    CloudInit-->>VM: ready
    Tester->>VM: trigger `brev-setup.sh`
    VM->>Setup: execute setup script
    Setup->>Docker: run `nvidia-smi` to detect GPU
    Docker-->>Setup: success (GPU) or fail (CPU-only)
    alt CPU-only
        Setup->>VM: edit /etc/docker/daemon.json to remove default-runtime
        Setup->>Docker: restart docker
    end
    Setup-->>VM: provisioning complete
    VM-->>Tester: setup finished
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I waited on clouds and checked each tool,

I sniffed for GPUs and fixed the pool.
I hopped through tests both long and spry,
Sanitized names and watched keys fly by,
A tidy hop — then off I spool.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'test: add real runAgentInSandbox() E2E tests for telegram injection (Phases 7-11)' directly and specifically describes the main change: adding new E2E test phases 7-11 for telegram injection testing via the real runAgentInSandbox() code path.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
.github/workflows/e2e-brev.yaml (1)

92-93: Repository condition removed for fork testing.

Commenting out the if: github.repository == 'NVIDIA/NemoClaw' condition allows the workflow to run on forks. This is safe because:

  1. Forks need to configure BREV_API_TOKEN and NVIDIA_API_KEY secrets themselves
  2. workflow_call still requires secrets to be passed explicitly

Consider restoring the condition before merging to main if fork testing is complete, or document that this is intentional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/e2e-brev.yaml around lines 92 - 93, The conditional "if:
github.repository == 'NVIDIA/NemoClaw'" was removed to allow fork testing;
either restore that repository guard before merging to main or add an explicit
note in the workflow header explaining this is intentional and temporary; if
keeping it removed, add a clear TODO/COMMENT referencing the required secrets
(BREV_API_TOKEN, NVIDIA_API_KEY) and remind maintainers that workflow_call still
requires secrets to be passed explicitly so reviewers know the risk and intended
duration of this change.
test/e2e/test-telegram-injection.sh (1)

466-507: Test harness calls unexported function with incorrect signature — expected to fail on unfixed code.

The test script calls bridge.runAgentInSandbox(message, sessionId, { apiKey, sandbox, openshell }) with 3 parameters and assumes the function is exported. However, per scripts/telegram-bridge.js:99-159:

  1. runAgentInSandbox only accepts 2 parameters: (message, sessionId)
  2. The function is not exported — it's an internal helper

Per the PR objectives, this test is designed to fail on the current unfixed code and will pass after the fix in PR #119 is merged. Consider adding an inline comment in the test script explaining this expected failure state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-telegram-injection.sh` around lines 466 - 507, Add an inline
comment at the top of the generated test script (the here-doc that creates
/tmp/test-real-bridge.js) explaining that this test intentionally calls
bridge.runAgentInSandbox with three arguments even though runAgentInSandbox is
currently an unexported internal helper that only accepts two parameters (per
scripts/telegram-bridge.js), so the test is expected to fail until PR `#119` is
merged; leave the test invocation unchanged so it serves as a known-failing
check and documents the expected state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/brev-setup.sh`:
- Around line 54-57: The pgrep call inside the while loop uses a basic regex
with escaped alternation that won't match as intended; update the pgrep
invocation in the while loop (the pgrep -x "apt-get\|apt\|dpkg" usage) to either
use extended regex (pgrep -E -x 'apt-get|apt|dpkg') or perform separate checks
(e.g., pgrep -x apt-get || pgrep -x apt || pgrep -x dpkg) so the loop correctly
detects any apt/apt-get/dpkg processes and waits accordingly.

In `@test/e2e/brev-e2e.test.js`:
- Around line 63-68: The envPrefix construction in sshEnv embeds
process.env.NVIDIA_API_KEY and process.env.GITHUB_TOKEN directly into
single-quoted shell strings, which breaks and enables command injection if
secrets contain single quotes; restore or reimplement the escaping used
previously (the shellQuote function from bin/lib/runner.js or equivalent) and
apply it to both process.env.NVIDIA_API_KEY and process.env.GITHUB_TOKEN when
building envPrefix (and anywhere sshEnv or envPrefix is used) so single quotes
are replaced with the safe sequence (e.g., closing quote, escaped quote,
reopening quote) to prevent injection.
- Around line 168-175: The pgrep call inside the ssh invocation in the E2E test
uses basic regex so the alternation literal "|" won't work; update the command
string passed to ssh (the multi-line string that contains `pgrep -x
"apt-get|apt|dpkg"`) to use extended regex by changing the pgrep flag to `-Ex`
(i.e., `pgrep -Ex "apt-get|apt|dpkg"`), matching the fix used in `brev-setup.sh`
so the loop correctly detects any of the three processes.

In `@test/e2e/test-telegram-injection.sh`:
- Around line 586-587: Remove the premature cleanup that deletes
/tmp/test-real-bridge.js after T11: delete the rm -f /tmp/test-real-bridge.js
command found in the T11 section so T12 can access the script, relying instead
on the existing Phase 11 cleanup that runs rm -f /tmp/test-real-bridge.js
/tmp/test-session-id.js; ensure no other occurrences in the T10–T12 block
(around the current rm at lines ~586–589 and 605–609) remain that would remove
the file before T12 runs.

---

Nitpick comments:
In @.github/workflows/e2e-brev.yaml:
- Around line 92-93: The conditional "if: github.repository ==
'NVIDIA/NemoClaw'" was removed to allow fork testing; either restore that
repository guard before merging to main or add an explicit note in the workflow
header explaining this is intentional and temporary; if keeping it removed, add
a clear TODO/COMMENT referencing the required secrets (BREV_API_TOKEN,
NVIDIA_API_KEY) and remind maintainers that workflow_call still requires secrets
to be passed explicitly so reviewers know the risk and intended duration of this
change.

In `@test/e2e/test-telegram-injection.sh`:
- Around line 466-507: Add an inline comment at the top of the generated test
script (the here-doc that creates /tmp/test-real-bridge.js) explaining that this
test intentionally calls bridge.runAgentInSandbox with three arguments even
though runAgentInSandbox is currently an unexported internal helper that only
accepts two parameters (per scripts/telegram-bridge.js), so the test is expected
to fail until PR `#119` is merged; leave the test invocation unchanged so it
serves as a known-failing check and documents the expected state.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4c8d90a7-d4ba-472c-ac3b-7a2c6754a848

📥 Commits

Reviewing files that changed from the base of the PR and between c63d37b and 3f376eb.

📒 Files selected for processing (5)
  • .github/workflows/e2e-brev.yaml
  • scripts/brev-setup.sh
  • scripts/setup.sh
  • test/e2e/brev-e2e.test.js
  • test/e2e/test-telegram-injection.sh

Copy link
Copy Markdown
Contributor

@ericksoa ericksoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the real runAgentInSandbox() tests and the Brev infra fixes, @jyaunches. Two things to fix before merge:

1. Restore the workflow fork guard.github/workflows/e2e-brev.yaml has the if: github.repository == 'NVIDIA/NemoClaw' guard commented out. This looks like a leftover from fork testing. If merged, the Brev E2E job will spin up on every fork PR (tests skip without BREV_API_TOKEN but the runner still gets consumed). Please uncomment it.

2. Restore shellEscape() for env var interpolationbrev-e2e.test.js removes the shellEscape() helper and stops escaping NVIDIA_API_KEY/GITHUB_TOKEN before interpolating into single-quoted shell strings. If a token ever contains a ', the remote shell command will break with an unterminated quote. The function was there for a reason — please restore it.

- Use pgrep -Ex for extended regex alternation (brev-setup.sh, brev-e2e.test.js)
- Restore shellQuote() for secret escaping in sshEnv() to prevent injection
  if secrets contain single quotes
- Remove premature cleanup of test-real-bridge.js before Phase 8 uses it
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/brev-setup.sh`:
- Around line 81-89: The retry loops around apt-get installs (the nodejs loop
using wait_for_apt and the Docker loop referenced at lines 99-106) currently
fall through on final failure; change them so that if all attempts fail they
exit the script with a non-zero status and log the failure. Concretely, after
the for-attempt loop for nodejs (the block containing wait_for_apt and sudo
apt-get install -y -qq nodejs) and similarly after the Docker install loop,
check whether the install succeeded and if not call error/info to report the
permanent failure and run exit 1 (or use set -e with an explicit conditional) so
bootstrapping stops instead of continuing with missing dependencies.
- Around line 52-57: The wait loop that polls pgrep -Ex "apt-get|apt|dpkg" can
hang indefinitely; modify the loop in scripts/brev-setup.sh to enforce a timeout
by introducing a max-wait (e.g., MAX_WAIT_SECONDS or MAX_ATTEMPTS) and an
elapsed/attempt counter, increment it each iteration (after sleep), and break
the loop if the timeout is reached—log a warning/error via the existing info
function (or exit non-zero if bootstrap must fail) so the script does not hang;
reference the existing while pgrep -Ex ... >/dev/null 2>&1 loop and the info
"Waiting for apt/dpkg processes to finish..." call when adding the timeout
logic.

In `@test/e2e/brev-e2e.test.js`:
- Around line 168-179: The loop that waits for apt in the ssh call can exit
after 120 tries with apt still running, so change the remote script executed by
ssh (the code around the ssh(...) call and the inline for loop) to fail the
readiness gate when apt/dpkg is still busy: after the for loop add a final check
(using the same fuser/pgrep conditions) and if it still finds locks or
processes, print an error and exit 1 so the ssh command returns non-zero;
otherwise continue normally. Ensure the failing branch returns non-zero to fail
the test/bootstrap rather than silently succeeding.

In `@test/e2e/test-telegram-injection.sh`:
- Around line 697-737: Add an end-to-end assertion that the caller uses the
sanitized session id by faking the ssh executable (or mocking
child_process.spawn) before invoking runAgentInSandbox from
scripts/telegram-bridge.js: ensure the test sets PATH to a temp dir containing a
wrapper "ssh" that captures argv and writes it out, invoke the real code path
that calls runAgentInSandbox with a leading-hyphen input, and assert the
captured argv value after the --session-id flag contains the sanitized token
(from sanitizeSessionId) not the original input; reference sanitizeSessionId and
runAgentInSandbox in scripts/telegram-bridge.js to locate where to inject the
fake ssh or spawn mock.
- Around line 567-579: Replace the single sleep + one-time ps snapshot with a
polling loop that repeatedly samples the process list until a short timeout to
catch short-lived ssh invocations: after setting API_KEY_PREFIX, run a loop
(e.g., total timeout ~5–10s with small sleeps like 0.2–0.5s) that reassigns
ps_output=$(ps aux ... | grep -v grep | grep -v "test-telegram-injection" ||
true) and checks if echo "$ps_output" | grep -qF "$API_KEY_PREFIX"; if found
call fail "T11: API key found..." and break immediately, otherwise continue
until timeout and then call pass "T11: API key NOT visible..." if never seen;
keep using the same API_KEY_PREFIX and ps_output symbols so the change is
localized in test-telegram-injection.sh.
- Around line 619-632: The test currently falls back to the ad-hoc SSH path (the
inline 'MSG=$(cat) && echo "$MSG"' command) and therefore doesn't exercise
scripts/telegram-bridge.js or runAgentInSandbox()'s multiline/backtick handling;
update the test so the SSH invocation actually runs the bridge/agent inside the
sandbox (e.g., call the same entrypoint used by runAgentInSandbox or directly
execute scripts/telegram-bridge.js) while still sending PAYLOAD_MULTILINE_BT via
stdin, so that the payload with backticks is processed by the bridge code
instead of being echoed by the ad-hoc SSH command; locate the invocation using
the PAYLOAD_MULTILINE_BT variable and the timeout ssh -F "$ssh_config_t13"
"openshell-${SANDBOX_NAME}" line and replace the inline 'MSG=$(cat)...' command
with the proper bridge/agent entrypoint.
- Around line 476-496: The harness requires the bridge before setting the
sandbox, so the module-level SANDBOX captured in scripts/telegram-bridge.js is
never overridden; move the require and module import after you compute and set
the test sandbox so the bridge sees the correct SANDBOX_NAME: compute sandbox
(const sandbox = ...), set process.env.SANDBOX_NAME = sandbox (and any other env
overrides like OPENSHELL_PATH if needed), then require the bridge and call
bridge.runAgentInSandbox(message, sessionId) (or update to pass options if you
prefer changing runAgentInSandbox to accept an options param).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 24da710a-c6cf-4aa7-aac9-2c644c0ae147

📥 Commits

Reviewing files that changed from the base of the PR and between 3f376eb and 0e11c49.

📒 Files selected for processing (3)
  • scripts/brev-setup.sh
  • test/e2e/brev-e2e.test.js
  • test/e2e/test-telegram-injection.sh

Comment on lines +168 to +179
// Wait for Brev's provisioning to finish — it runs apt-get on boot
// and the locks can be released between operations, so we wait for
// both locks AND processes to be gone.
console.log(`[${elapsed()}] Waiting for Brev provisioning (apt) to finish...`);
ssh(
`for i in $(seq 1 120); do ` +
`if fuser /var/lib/dpkg/lock-frontend /var/lib/apt/lists/lock /var/cache/apt/archives/lock >/dev/null 2>&1 || ` +
`pgrep -Ex "apt-get|apt|dpkg" >/dev/null 2>&1; then ` +
`echo "apt still running... ($i/120)"; sleep 5; ` +
`else break; fi; done`,
{ timeout: 660_000, stream: true },
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail the readiness gate when apt is still busy after the loop.

This loop always returns success after 120 iterations, even if apt/dpkg never settles. In that case the bootstrap starts anyway and the same lock race can still happen.

Proposed fix
     console.log(`[${elapsed()}] Waiting for Brev provisioning (apt) to finish...`);
     ssh(
-      `for i in $(seq 1 120); do ` +
+      `ready=0; for i in $(seq 1 120); do ` +
         `if fuser /var/lib/dpkg/lock-frontend /var/lib/apt/lists/lock /var/cache/apt/archives/lock >/dev/null 2>&1 || ` +
         `pgrep -Ex "apt-get|apt|dpkg" >/dev/null 2>&1; then ` +
         `echo "apt still running... ($i/120)"; sleep 5; ` +
-        `else break; fi; done`,
+        `else ready=1; break; fi; done; ` +
+        `test "$ready" = 1 || { echo "apt still running after 600s"; exit 1; }`,
       { timeout: 660_000, stream: true },
     );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Wait for Brev's provisioning to finish — it runs apt-get on boot
// and the locks can be released between operations, so we wait for
// both locks AND processes to be gone.
console.log(`[${elapsed()}] Waiting for Brev provisioning (apt) to finish...`);
ssh(
`for i in $(seq 1 120); do ` +
`if fuser /var/lib/dpkg/lock-frontend /var/lib/apt/lists/lock /var/cache/apt/archives/lock >/dev/null 2>&1 || ` +
`pgrep -Ex "apt-get|apt|dpkg" >/dev/null 2>&1; then ` +
`echo "apt still running... ($i/120)"; sleep 5; ` +
`else break; fi; done`,
{ timeout: 660_000, stream: true },
);
// Wait for Brev's provisioning to finish — it runs apt-get on boot
// and the locks can be released between operations, so we wait for
// both locks AND processes to be gone.
console.log(`[${elapsed()}] Waiting for Brev provisioning (apt) to finish...`);
ssh(
`ready=0; for i in $(seq 1 120); do ` +
`if fuser /var/lib/dpkg/lock-frontend /var/lib/apt/lists/lock /var/cache/apt/archives/lock >/dev/null 2>&1 || ` +
`pgrep -Ex "apt-get|apt|dpkg" >/dev/null 2>&1; then ` +
`echo "apt still running... ($i/120)"; sleep 5; ` +
`else ready=1; break; fi; done; ` +
`test "$ready" = 1 || { echo "apt still running after 600s"; exit 1; }`,
{ timeout: 660_000, stream: true },
);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/brev-e2e.test.js` around lines 168 - 179, The loop that waits for
apt in the ssh call can exit after 120 tries with apt still running, so change
the remote script executed by ssh (the code around the ssh(...) call and the
inline for loop) to fail the readiness gate when apt/dpkg is still busy: after
the for loop add a final check (using the same fuser/pgrep conditions) and if it
still finds locks or processes, print an error and exit 1 so the ssh command
returns non-zero; otherwise continue normally. Ensure the failing branch returns
non-zero to fail the test/bootstrap rather than silently succeeding.

Comment on lines +476 to +496
const bridge = require(path.join(process.cwd(), 'scripts/telegram-bridge.js'));

async function test() {
const message = process.argv[2] || 'Hello';
const sessionId = process.argv[3] || 'e2e-test';
const apiKey = process.env.NVIDIA_API_KEY;
const sandbox = process.env.NEMOCLAW_SANDBOX_NAME || 'e2e-test';
const openshell = process.env.OPENSHELL_PATH || 'openshell';

if (!apiKey) {
console.error('NVIDIA_API_KEY required');
process.exit(1);
}

try {
// Call the real runAgentInSandbox with options override for testing
const result = await bridge.runAgentInSandbox(message, sessionId, {
apiKey,
sandbox,
openshell,
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This harness doesn't currently override the bridge config.

In scripts/telegram-bridge.js, runAgentInSandbox is defined as runAgentInSandbox(message, sessionId) and captures SANDBOX from process.env.SANDBOX_NAME when the module loads. The third argument passed here is ignored, and this script never sets SANDBOX_NAME before require(...), so T9/T10/T12/T15 can hit the default sandbox or fail before the intended path is exercised.

Proposed fix
-const bridge = require(path.join(process.cwd(), 'scripts/telegram-bridge.js'));
+if (process.env.NEMOCLAW_SANDBOX_NAME) {
+  process.env.SANDBOX_NAME = process.env.NEMOCLAW_SANDBOX_NAME;
+}
+const bridge = require(path.join(process.cwd(), 'scripts/telegram-bridge.js'));

 async function test() {
   const message = process.argv[2] || 'Hello';
   const sessionId = process.argv[3] || 'e2e-test';
-  const apiKey = process.env.NVIDIA_API_KEY;
-  const sandbox = process.env.NEMOCLAW_SANDBOX_NAME || 'e2e-test';
-  const openshell = process.env.OPENSHELL_PATH || 'openshell';

-  if (!apiKey) {
+  if (!process.env.NVIDIA_API_KEY) {
     console.error('NVIDIA_API_KEY required');
     process.exit(1);
   }

   try {
-    // Call the real runAgentInSandbox with options override for testing
-    const result = await bridge.runAgentInSandbox(message, sessionId, {
-      apiKey,
-      sandbox,
-      openshell,
-    });
+    const result = await bridge.runAgentInSandbox(message, sessionId);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const bridge = require(path.join(process.cwd(), 'scripts/telegram-bridge.js'));
async function test() {
const message = process.argv[2] || 'Hello';
const sessionId = process.argv[3] || 'e2e-test';
const apiKey = process.env.NVIDIA_API_KEY;
const sandbox = process.env.NEMOCLAW_SANDBOX_NAME || 'e2e-test';
const openshell = process.env.OPENSHELL_PATH || 'openshell';
if (!apiKey) {
console.error('NVIDIA_API_KEY required');
process.exit(1);
}
try {
// Call the real runAgentInSandbox with options override for testing
const result = await bridge.runAgentInSandbox(message, sessionId, {
apiKey,
sandbox,
openshell,
});
if (process.env.NEMOCLAW_SANDBOX_NAME) {
process.env.SANDBOX_NAME = process.env.NEMOCLAW_SANDBOX_NAME;
}
const bridge = require(path.join(process.cwd(), 'scripts/telegram-bridge.js'));
async function test() {
const message = process.argv[2] || 'Hello';
const sessionId = process.argv[3] || 'e2e-test';
if (!process.env.NVIDIA_API_KEY) {
console.error('NVIDIA_API_KEY required');
process.exit(1);
}
try {
const result = await bridge.runAgentInSandbox(message, sessionId);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-telegram-injection.sh` around lines 476 - 496, The harness
requires the bridge before setting the sandbox, so the module-level SANDBOX
captured in scripts/telegram-bridge.js is never overridden; move the require and
module import after you compute and set the test sandbox so the bridge sees the
correct SANDBOX_NAME: compute sandbox (const sandbox = ...), set
process.env.SANDBOX_NAME = sandbox (and any other env overrides like
OPENSHELL_PATH if needed), then require the bridge and call
bridge.runAgentInSandbox(message, sessionId) (or update to pass options if you
prefer changing runAgentInSandbox to accept an options param).

Comment on lines +697 to +737
# Create a fresh test script that checks the actual session ID used
cat >/tmp/test-session-id.js <<'TESTSCRIPT'
const path = require('path');
const resolveOpenshellPath = require.resolve(path.join(process.cwd(), 'bin/lib/resolve-openshell'));
require.cache[resolveOpenshellPath] = {
exports: { resolveOpenshell: () => process.env.OPENSHELL_PATH || 'openshell' }
};
const bridge = require(path.join(process.cwd(), 'scripts/telegram-bridge.js'));

// Just test the sanitization with a negative chat ID (like Telegram groups)
const sessionId = process.argv[2] || '--help';
const sanitized = bridge.sanitizeSessionId(sessionId);
console.log('INPUT:', sessionId);
console.log('SANITIZED:', sanitized);
if (sanitized && sanitized.startsWith('-')) {
console.log('ERROR: Sanitized ID starts with hyphen — option injection possible!');
process.exit(1);
} else {
console.log('OK: No leading hyphen in sanitized ID');
process.exit(0);
}
TESTSCRIPT

t15_result=$(cd "$REPO" && OPENSHELL_PATH="$OPENSHELL_PATH" \
node /tmp/test-session-id.js '--help' 2>&1) || true

if echo "$t15_result" | grep -q "OK: No leading hyphen"; then
pass "T15: Session ID '--help' sanitized to prevent option injection"
else
fail "T15: Session ID sanitization failed: $t15_result"
fi

# Also test with negative number (Telegram group chat ID)
t15b_result=$(cd "$REPO" && OPENSHELL_PATH="$OPENSHELL_PATH" \
node /tmp/test-session-id.js '-123456789' 2>&1) || true

if echo "$t15b_result" | grep -q "OK: No leading hyphen"; then
pass "T15b: Negative chat ID '-123456789' sanitized correctly"
else
fail "T15b: Negative chat ID sanitization failed: $t15b_result"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

T15 doesn't validate the caller yet.

This temporary script only checks sanitizeSessionId() in isolation. It never proves that runAgentInSandbox() actually uses the sanitized value when building the SSH command, so an option-injection regression in the caller would still pass.

One way to cover the advertised “real call” behavior is to put a fake ssh earlier in PATH (or mock child_process.spawn) and assert that the argv after --session-id contains the sanitized token rather than the original leading-hyphen input.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/test-telegram-injection.sh` around lines 697 - 737, Add an
end-to-end assertion that the caller uses the sanitized session id by faking the
ssh executable (or mocking child_process.spawn) before invoking
runAgentInSandbox from scripts/telegram-bridge.js: ensure the test sets PATH to
a temp dir containing a wrapper "ssh" that captures argv and writes it out,
invoke the real code path that calls runAgentInSandbox with a leading-hyphen
input, and assert the captured argv value after the --session-id flag contains
the sanitized token (from sanitizeSessionId) not the original input; reference
sanitizeSessionId and runAgentInSandbox in scripts/telegram-bridge.js to locate
where to inject the fake ssh or spawn mock.

Uncomment the github.repository check so Brev E2E only runs on
NVIDIA/NemoClaw, not on every fork PR.
@jyaunches jyaunches force-pushed the test/telegram-injection-only branch 2 times, most recently from bacdf3a to 409caca Compare April 1, 2026 13:17
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/brev-setup.sh (1)

120-153: ⚠️ Potential issue | 🟠 Major

Extract and reuse GPU detection logic consistently.

The toolkit section (line 88) correctly validates GPU presence by running nvidia-smi, but the vLLM gate (line 168) only checks if the nvidia-smi command exists. On Brev CPU-only images that ship nvidia-smi without a functional GPU, vLLM will still be installed. Create a shared GPU detection function that validates actual GPU availability (not just binary presence) and use it in both locations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/brev-setup.sh` around lines 120 - 153, Create a reusable GPU
detection helper (e.g., detect_gpu or has_gpu) that returns true only when the
nvidia-smi binary exists and actually reports a GPU (run nvidia-smi and check
its exit/status), then replace the two ad-hoc checks: the current block that
begins with "if command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi >/dev/null
2>&1; then" (the NVIDIA Container Toolkit install gate) and the vLLM
installation gate that only checks for the presence of the nvidia-smi command,
to call this new detect_gpu function instead so both places require a functional
GPU before proceeding. Ensure the new helper is defined near the top of the
script and used in both places (referencing detect_gpu or has_gpu in the
conditionals).
♻️ Duplicate comments (1)
scripts/brev-setup.sh (1)

54-57: ⚠️ Potential issue | 🟠 Major

Add the same timeout to the process-only loop.

If apt/dpkg gets stuck after releasing its lock files, this loop never exits and the VM bootstrap hangs forever. Reuse max_wait/waited here too.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/brev-setup.sh` around lines 54 - 57, The apt/dpkg wait loop (while
pgrep -Ex "apt-get|apt|dpkg" >/dev/null 2>&1; do ... done) can hang indefinitely
if processes leave locks but persist; update this loop to reuse the existing
timeout logic by referencing the max_wait and waited variables: increment waited
inside the loop, check if waited exceeds max_wait, and break/fail (e.g., return
non-zero or log and exit) to avoid infinite hang. Ensure the same info "Waiting
for apt/dpkg processes to finish..." message remains and that waited is
incremented and compared to max_wait in this loop exactly as done in the other
lock-waiting loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/brev-setup.sh`:
- Around line 40-58: The wait_for_apt helper (function wait_for_apt) currently
returns as soon as both lock files (checked via fuser) and apt/dpkg processes
(checked via pgrep) are momentarily absent; change it to require a short “quiet
window” before returning: after seeing no locks and no apt/dpkg processes, start
a small timer/window (e.g., 5–10s) and continue polling both checks during that
window, resetting the timer if any lock or process reappears; keep using the
existing retry/backoff semantics (max_wait, waited) and the same fuser/pgrep
checks so the function only returns once both checks have been continuously
clear for the chosen quiet window.

---

Outside diff comments:
In `@scripts/brev-setup.sh`:
- Around line 120-153: Create a reusable GPU detection helper (e.g., detect_gpu
or has_gpu) that returns true only when the nvidia-smi binary exists and
actually reports a GPU (run nvidia-smi and check its exit/status), then replace
the two ad-hoc checks: the current block that begins with "if command -v
nvidia-smi >/dev/null 2>&1 && nvidia-smi >/dev/null 2>&1; then" (the NVIDIA
Container Toolkit install gate) and the vLLM installation gate that only checks
for the presence of the nvidia-smi command, to call this new detect_gpu function
instead so both places require a functional GPU before proceeding. Ensure the
new helper is defined near the top of the script and used in both places
(referencing detect_gpu or has_gpu in the conditionals).

---

Duplicate comments:
In `@scripts/brev-setup.sh`:
- Around line 54-57: The apt/dpkg wait loop (while pgrep -Ex "apt-get|apt|dpkg"
>/dev/null 2>&1; do ... done) can hang indefinitely if processes leave locks but
persist; update this loop to reuse the existing timeout logic by referencing the
max_wait and waited variables: increment waited inside the loop, check if waited
exceeds max_wait, and break/fail (e.g., return non-zero or log and exit) to
avoid infinite hang. Ensure the same info "Waiting for apt/dpkg processes to
finish..." message remains and that waited is incremented and compared to
max_wait in this loop exactly as done in the other lock-waiting loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a4734c54-6789-4e46-91f5-47b407ae5f5d

📥 Commits

Reviewing files that changed from the base of the PR and between 972cc51 and bacdf3a.

📒 Files selected for processing (2)
  • .github/workflows/e2e-brev.yaml
  • scripts/brev-setup.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/e2e-brev.yaml

Comment on lines +40 to +58
# Wait for any existing apt locks (e.g. Brev's own provisioning on boot)
wait_for_apt() {
local max_wait=300 # 5 minutes
local waited=0
while fuser /var/lib/dpkg/lock-frontend /var/lib/apt/lists/lock /var/cache/apt/archives/lock >/dev/null 2>&1; do
if [ $waited -ge $max_wait ]; then
fail "apt lock still held after ${max_wait}s — another process is stuck"
fi
info "Waiting for apt lock to be released... (${waited}s)"
sleep 10
waited=$((waited + 10))
done
# Wait for any apt processes to fully exit — locks can be released
# momentarily between apt operations in a multi-step provisioning sequence
while pgrep -Ex "apt-get|apt|dpkg" >/dev/null 2>&1; do
info "Waiting for apt/dpkg processes to finish..."
sleep 5
done
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Require a quiet window before returning.

Lines 52-53 say this is protecting multi-step provisioning, but the helper still returns on the first moment where both checks are clear. A brief gap between two apt steps will satisfy both loops and let our install start right before the next lock acquisition. Keep polling until locks and apt/dpkg processes have both stayed absent for a short window.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/brev-setup.sh` around lines 40 - 58, The wait_for_apt helper
(function wait_for_apt) currently returns as soon as both lock files (checked
via fuser) and apt/dpkg processes (checked via pgrep) are momentarily absent;
change it to require a short “quiet window” before returning: after seeing no
locks and no apt/dpkg processes, start a small timer/window (e.g., 5–10s) and
continue polling both checks during that window, resetting the timer if any lock
or process reappears; keep using the existing retry/backoff semantics (max_wait,
waited) and the same fuser/pgrep checks so the function only returns once both
checks have been continuously clear for the chosen quiet window.

@jyaunches jyaunches force-pushed the test/telegram-injection-only branch 2 times, most recently from e270d25 to ff85730 Compare April 1, 2026 13:24
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/brev-setup.sh`:
- Around line 40-45: The script currently swallows cloud-init failures with
"cloud-init status --wait >/dev/null 2>&1 || true"; remove the "|| true" and
instead check the exit status of "cloud-init status --wait", logging a clear
error via the info/error logging helper used in the script (the existing info
calls around cloud-init) and exit with a non-zero status (e.g., exit 1) if
cloud-init fails so the setup fails fast and doesn't proceed in an inconsistent
state.
- Around line 86-88: The script currently tests a runnable nvidia-smi in one
place but later re-checks only binary presence when gating vLLM; define and
export a single GPU capability flag (e.g., GPU_AVAILABLE) right after the
runnable check (the block using command -v nvidia-smi and nvidia-smi) and then
replace other ad-hoc checks (the later vLLM gating that only calls command -v
nvidia-smi) to consult that single flag; ensure the flag is set to true only
when nvidia-smi successfully runs and false otherwise, and use that variable
everywhere the script needs to decide GPU vs CPU behavior (including vLLM
gating).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4b49e912-4770-41bd-af5f-e22b5e971985

📥 Commits

Reviewing files that changed from the base of the PR and between bacdf3a and ff85730.

📒 Files selected for processing (4)
  • .github/workflows/e2e-brev.yaml
  • scripts/brev-setup.sh
  • test/e2e/brev-e2e.test.js
  • test/e2e/test-telegram-injection.sh
🚧 Files skipped from review as they are similar to previous changes (3)
  • .github/workflows/e2e-brev.yaml
  • test/e2e/brev-e2e.test.js
  • test/e2e/test-telegram-injection.sh

Comment on lines +40 to +45
# Wait for cloud-init to finish — Brev instances run apt provisioning on boot.
# This is the standard way to wait for all boot-time setup to complete.
if command -v cloud-init >/dev/null 2>&1; then
info "Waiting for cloud-init to finish..."
cloud-init status --wait >/dev/null 2>&1 || true
info "cloud-init done"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast if cloud-init does not complete successfully.

At Line 44, || true suppresses a real bootstrapping failure and can let later apt/setup steps run in an inconsistent state.

Proposed fix
 if command -v cloud-init >/dev/null 2>&1; then
   info "Waiting for cloud-init to finish..."
-  cloud-init status --wait >/dev/null 2>&1 || true
+  cloud-init status --wait >/dev/null 2>&1 \
+    || fail "cloud-init failed; aborting bootstrap"
   info "cloud-init done"
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/brev-setup.sh` around lines 40 - 45, The script currently swallows
cloud-init failures with "cloud-init status --wait >/dev/null 2>&1 || true";
remove the "|| true" and instead check the exit status of "cloud-init status
--wait", logging a clear error via the info/error logging helper used in the
script (the existing info calls around cloud-init) and exit with a non-zero
status (e.g., exit 1) if cloud-init fails so the setup fails fast and doesn't
proceed in an inconsistent state.

Comment on lines +86 to +88
# Check that nvidia-smi actually works, not just that the binary exists.
# Brev GPU images ship nvidia-smi even on CPU-only instances.
if command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi >/dev/null 2>&1; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use one GPU capability flag across all GPU-gated sections.

This block correctly checks runnable nvidia-smi, but later vLLM gating still checks only binary presence (Line 168). On CPU-only images that still ship nvidia-smi, that can trigger wrong behavior.

Proposed fix
+# Detect GPU availability once and reuse everywhere.
+HAS_GPU=false
+if command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi >/dev/null 2>&1; then
+  HAS_GPU=true
+fi
+
 # --- 2. NVIDIA Container Toolkit (if GPU present) ---
-if command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi >/dev/null 2>&1; then
+if [ "$HAS_GPU" = true ]; then
   ...
 fi

 # --- 4. vLLM (local inference, if GPU present) ---
 ...
-elif command -v nvidia-smi >/dev/null 2>&1; then
+elif [ "$HAS_GPU" = true ]; then
   ...
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/brev-setup.sh` around lines 86 - 88, The script currently tests a
runnable nvidia-smi in one place but later re-checks only binary presence when
gating vLLM; define and export a single GPU capability flag (e.g.,
GPU_AVAILABLE) right after the runnable check (the block using command -v
nvidia-smi and nvidia-smi) and then replace other ad-hoc checks (the later vLLM
gating that only calls command -v nvidia-smi) to consult that single flag;
ensure the flag is set to true only when nvidia-smi successfully runs and false
otherwise, and use that variable everywhere the script needs to decide GPU vs
CPU behavior (including vLLM gating).

@jyaunches jyaunches force-pushed the test/telegram-injection-only branch from ff85730 to 155b82b Compare April 1, 2026 13:39
@jyaunches jyaunches force-pushed the test/telegram-injection-only branch from 155b82b to ab80478 Compare April 1, 2026 14:24
@jyaunches
Copy link
Copy Markdown
Contributor Author

Splitting into two focused PRs to separate concerns:

  • Infra: Brev E2E hardening for CPU-only instances (setup scripts, test harness)
  • Tests: Telegram injection E2E tests (test-telegram-injection.sh only)

This avoids review ping-pong on shared infrastructure changes in a test-focused PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants