tests: improve robustness against log pipeline delays and CI load#1144
Merged
uncleDecart merged 4 commits intolf-edge:masterfrom Apr 23, 2026
Merged
tests: improve robustness against log pipeline delays and CI load#1144uncleDecart merged 4 commits intolf-edge:masterfrom
uncleDecart merged 4 commits intolf-edge:masterfrom
Conversation
nginx was deployed without port forwarding, so `eden pod ps` always returned "-" in the INTERNAL column, making server_ip.sh set ESERVER_IP="-" and causing `curl -` to fail. Fix by adding -p 8080:80 to the nginx deployment so its internal IP appears in pod ps. Also replace the single-shot stale pod_ps file read in server_ip.sh with a live retry loop (up to 30x10s), matching the same race condition fix applied to networking_light.txt in f1902ce. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After an EVE restart the first log-upload batch can be delayed 10+ minutes while startup logs queue up. The previous ssh.sh finished in ~2.5 minutes, leaving no new "Disconnected" entries being generated when the delivery pipeline eventually recovered. The 10m timewait expired 29 seconds before the backlog cleared in CI run 24686747135. Three changes: - timewait 10m → 15m: enough buffer for the delayed first-batch upload to deliver the "Disconnected" messages. - ssh.sh: replace the 15-iteration finite loop with a 14-minute time-bounded loop so connections are generated throughout the test window. If the pipeline recovers at any point, the most recent connection's log arrives within seconds. - exec -t 5m → 16m: raise the background exec timeout to match the new loop duration. Add trap 'exit 0' TERM INT so the process exits cleanly when testscript sends SIGINT at script end. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After an EVE reboot the log delivery pipeline can be completely silent for 15+ minutes while startup logs drain. The previous fix (15m timewait + continuous SSH connections) still times out because the SSH disconnect entries are buried behind the startup backlog and never surface within the window. Switch to a two-phase approach: - Phase 1 (25m timewait): wait for ANY log entry to prove the pipeline is delivering fresh entries to the controller. - Phase 2 (7m timewait): only then start generating SSH connections and look for the "Disconnected" message. This ensures the Disconnected search never starts on a broken pipeline. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Under CI load, app1's VLAN networking can take longer than 1 minute to become reachable. The original wait_and_get_recv_data.sh used a fixed count loop (12 × 5s = 60s) which was too short. Replace with a time-bounded loop (~4.5 min) and raise the enclosing exec timeout from 5m to 7m to match. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This was referenced May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Two intermittent test failures observed in CI (ext4+TPM matrix):
TestLog(smoke suite) — After EVE reboots, the log delivery pipelinecan be completely silent for 15+ minutes while startup logs drain. Waiting
longer or generating more SSH connections doesn't help because the disconnect
entries are queued behind the backlog and never reach the controller within
the test window.
switch_net_vlans— Under CI load, VLAN networking for app1 takeslonger to become reachable than the 1-minute retry window in
wait_and_get_recv_data.shallowed.Fix
tests/lim/testdata/log_test.txtSwitch to a two-phase approach:
proving the pipeline is active and delivering fresh entries.
look for the "Disconnected" message — guaranteeing it won't be lost in a
backlog.
tests/network/testdata/switch_net_vlans.txtReplace the fixed-count loop (12 × 5 s = 60 s) in
wait_and_get_recv_data.shwith a time-bounded loop (~4.5 min), and raise the enclosing
exectimeoutfrom 5 m to 7 m.