Skip to content

tests: improve robustness against log pipeline delays and CI load#1144

Merged
uncleDecart merged 4 commits intolf-edge:masterfrom
eriknordmark:erik-lps
Apr 23, 2026
Merged

tests: improve robustness against log pipeline delays and CI load#1144
uncleDecart merged 4 commits intolf-edge:masterfrom
eriknordmark:erik-lps

Conversation

@eriknordmark
Copy link
Copy Markdown
Contributor

@eriknordmark eriknordmark commented Apr 20, 2026

Problem

Two intermittent test failures observed in CI (ext4+TPM matrix):

  1. TestLog (smoke suite) — After EVE reboots, the log delivery pipeline
    can be completely silent for 15+ minutes while startup logs drain. Waiting
    longer or generating more SSH connections doesn't help because the disconnect
    entries are queued behind the backlog and never reach the controller within
    the test window.

  2. switch_net_vlans — Under CI load, VLAN networking for app1 takes
    longer to become reachable than the 1-minute retry window in
    wait_and_get_recv_data.sh allowed.

Fix

tests/lim/testdata/log_test.txt

Switch to a two-phase approach:

  • Phase 1 (25m timewait): wait for ANY log entry arriving at the controller,
    proving the pipeline is active and delivering fresh entries.
  • Phase 2 (7m timewait): only then start generating SSH connections and
    look for the "Disconnected" message — guaranteeing it won't be lost in a
    backlog.

tests/network/testdata/switch_net_vlans.txt

Replace the fixed-count loop (12 × 5 s = 60 s) in wait_and_get_recv_data.sh
with a time-bounded loop (~4.5 min), and raise the enclosing exec timeout
from 5 m to 7 m.

@eriknordmark eriknordmark changed the title tests/eclient: fix nginx test IP discovery and add retry logic tests: fix nginx IP discovery and log_test delivery race Apr 21, 2026
@eriknordmark eriknordmark changed the title tests: fix nginx IP discovery and log_test delivery race tests: improve robustness against log pipeline delays and CI load Apr 21, 2026
eriknordmark and others added 4 commits April 22, 2026 18:36
nginx was deployed without port forwarding, so `eden pod ps` always
returned "-" in the INTERNAL column, making server_ip.sh set
ESERVER_IP="-" and causing `curl -` to fail. Fix by adding -p 8080:80
to the nginx deployment so its internal IP appears in pod ps.

Also replace the single-shot stale pod_ps file read in server_ip.sh
with a live retry loop (up to 30x10s), matching the same race condition
fix applied to networking_light.txt in f1902ce.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After an EVE restart the first log-upload batch can be delayed 10+
minutes while startup logs queue up. The previous ssh.sh finished
in ~2.5 minutes, leaving no new "Disconnected" entries being generated
when the delivery pipeline eventually recovered. The 10m timewait
expired 29 seconds before the backlog cleared in CI run 24686747135.

Three changes:
- timewait 10m → 15m: enough buffer for the delayed first-batch
  upload to deliver the "Disconnected" messages.
- ssh.sh: replace the 15-iteration finite loop with a 14-minute
  time-bounded loop so connections are generated throughout the
  test window. If the pipeline recovers at any point, the most
  recent connection's log arrives within seconds.
- exec -t 5m → 16m: raise the background exec timeout to match
  the new loop duration. Add trap 'exit 0' TERM INT so the
  process exits cleanly when testscript sends SIGINT at script end.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After an EVE reboot the log delivery pipeline can be completely silent
for 15+ minutes while startup logs drain. The previous fix (15m timewait
+ continuous SSH connections) still times out because the SSH disconnect
entries are buried behind the startup backlog and never surface within
the window.

Switch to a two-phase approach:
- Phase 1 (25m timewait): wait for ANY log entry to prove the pipeline
  is delivering fresh entries to the controller.
- Phase 2 (7m timewait): only then start generating SSH connections and
  look for the "Disconnected" message.

This ensures the Disconnected search never starts on a broken pipeline.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Under CI load, app1's VLAN networking can take longer than 1 minute to
become reachable. The original wait_and_get_recv_data.sh used a fixed
count loop (12 × 5s = 60s) which was too short.

Replace with a time-bounded loop (~4.5 min) and raise the enclosing
exec timeout from 5m to 7m to match.

Signed-off-by: eriknordmark <erik@zededa.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@uncleDecart uncleDecart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@rene rene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@uncleDecart uncleDecart merged commit 5c6ebf7 into lf-edge:master Apr 23, 2026
17 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants