tests: improve robustness against log pipeline delays and CI load by eriknordmark · Pull Request #1144 · lf-edge/eden

eriknordmark · 2026-04-20T19:44:22Z

Problem

Two intermittent test failures observed in CI (ext4+TPM matrix):

TestLog (smoke suite) — After EVE reboots, the log delivery pipeline
can be completely silent for 15+ minutes while startup logs drain. Waiting
longer or generating more SSH connections doesn't help because the disconnect
entries are queued behind the backlog and never reach the controller within
the test window.
switch_net_vlans — Under CI load, VLAN networking for app1 takes
longer to become reachable than the 1-minute retry window in
wait_and_get_recv_data.sh allowed.

Fix

`tests/lim/testdata/log_test.txt`

Switch to a two-phase approach:

Phase 1 (25m timewait): wait for ANY log entry arriving at the controller,
proving the pipeline is active and delivering fresh entries.
Phase 2 (7m timewait): only then start generating SSH connections and
look for the "Disconnected" message — guaranteeing it won't be lost in a
backlog.

`tests/network/testdata/switch_net_vlans.txt`

Replace the fixed-count loop (12 × 5 s = 60 s) in wait_and_get_recv_data.sh
with a time-bounded loop (~4.5 min), and raise the enclosing exec timeout
from 5 m to 7 m.

nginx was deployed without port forwarding, so `eden pod ps` always returned "-" in the INTERNAL column, making server_ip.sh set ESERVER_IP="-" and causing `curl -` to fail. Fix by adding -p 8080:80 to the nginx deployment so its internal IP appears in pod ps. Also replace the single-shot stale pod_ps file read in server_ip.sh with a live retry loop (up to 30x10s), matching the same race condition fix applied to networking_light.txt in f1902ce. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

After an EVE restart the first log-upload batch can be delayed 10+ minutes while startup logs queue up. The previous ssh.sh finished in ~2.5 minutes, leaving no new "Disconnected" entries being generated when the delivery pipeline eventually recovered. The 10m timewait expired 29 seconds before the backlog cleared in CI run 24686747135. Three changes: - timewait 10m → 15m: enough buffer for the delayed first-batch upload to deliver the "Disconnected" messages. - ssh.sh: replace the 15-iteration finite loop with a 14-minute time-bounded loop so connections are generated throughout the test window. If the pipeline recovers at any point, the most recent connection's log arrives within seconds. - exec -t 5m → 16m: raise the background exec timeout to match the new loop duration. Add trap 'exit 0' TERM INT so the process exits cleanly when testscript sends SIGINT at script end. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

After an EVE reboot the log delivery pipeline can be completely silent for 15+ minutes while startup logs drain. The previous fix (15m timewait + continuous SSH connections) still times out because the SSH disconnect entries are buried behind the startup backlog and never surface within the window. Switch to a two-phase approach: - Phase 1 (25m timewait): wait for ANY log entry to prove the pipeline is delivering fresh entries to the controller. - Phase 2 (7m timewait): only then start generating SSH connections and look for the "Disconnected" message. This ensures the Disconnected search never starts on a broken pipeline. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Under CI load, app1's VLAN networking can take longer than 1 minute to become reachable. The original wait_and_get_recv_data.sh used a fixed count loop (12 × 5s = 60s) which was too short. Replace with a time-bounded loop (~4.5 min) and raise the enclosing exec timeout from 5m to 7m to match. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

uncleDecart

LGTM

rene

LGTM

eriknordmark requested a review from uncleDecart as a code owner April 20, 2026 19:44

eriknordmark changed the title ~~tests/eclient: fix nginx test IP discovery and add retry logic~~ tests: fix nginx IP discovery and log_test delivery race Apr 21, 2026

eriknordmark changed the title ~~tests: fix nginx IP discovery and log_test delivery race~~ tests: improve robustness against log pipeline delays and CI load Apr 21, 2026

eriknordmark requested review from europaul, milan-zededa and rene April 21, 2026 21:13

eriknordmark and others added 4 commits April 22, 2026 18:36

eriknordmark force-pushed the erik-lps branch from 8808065 to deb2e06 Compare April 22, 2026 16:36

uncleDecart approved these changes Apr 23, 2026

View reviewed changes

rene approved these changes Apr 23, 2026

View reviewed changes

uncleDecart merged commit 5c6ebf7 into lf-edge:master Apr 23, 2026
17 of 20 checks passed

This was referenced May 7, 2026

[16.0-stable] backport test robustness fixes from master #1157

Merged

[14.5-stable] backport test robustness fixes from master #1159

Merged

[13.4-stable] backport test robustness fixes from master #1160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: improve robustness against log pipeline delays and CI load#1144

tests: improve robustness against log pipeline delays and CI load#1144
uncleDecart merged 4 commits intolf-edge:masterfrom
eriknordmark:erik-lps

eriknordmark commented Apr 20, 2026 •

edited

Loading

Uh oh!

uncleDecart left a comment

Uh oh!

rene left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eriknordmark commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

tests/lim/testdata/log_test.txt

tests/network/testdata/switch_net_vlans.txt

Uh oh!

uncleDecart left a comment

Choose a reason for hiding this comment

Uh oh!

rene left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eriknordmark commented Apr 20, 2026 •

edited

Loading

`tests/lim/testdata/log_test.txt`

`tests/network/testdata/switch_net_vlans.txt`