[16.0-stable] backport test robustness fixes from master#1157
Merged
eriknordmark merged 17 commits intolf-edge:EVE-16.0-stablefrom May 8, 2026
Merged
[16.0-stable] backport test robustness fixes from master#1157eriknordmark merged 17 commits intolf-edge:EVE-16.0-stablefrom
eriknordmark merged 17 commits intolf-edge:EVE-16.0-stablefrom
Conversation
…order Replace the iteration-bounded seq/sleep loop in ssh.sh with a time-bounded while loop (18 min, explicit exit 1 on expiry) to reliably fill the exec -t 20m budget. Restore TestInfo as a background process started before ssh.sh. Eden's InfoChecker uses InfoNew mode (new messages only), so TestInfo must be subscribed before the curl POST fires to capture the immediate ZInfoMsg EVE sends on AppInstMetadata receipt. Running TestInfo after ssh.sh meant the subscription was registered after that message had already arrived at Adam; the next info comes from the periodic timer (10m ± 20% jitter), well outside any reasonable timewait. Raise the TestInfo timewait to 20m to cover the full 18m ssh.sh window plus the periodic timer as a fallback. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit f7396c7)
…sshd readiness check Replace iteration-bounded seq/sleep SSH retry loops in eclient.txt and userdata.txt with time-bounded while loops. The old loops ran for at most seq*sleep seconds and fell through silently; the new loops run until 30s before the exec -t deadline and exit 1 explicitly when no connection was made, making failures visible rather than letting the test continue with an ambiguous result. In log_test.txt, add a foreground wait_ssh.sh step (5m timeout) that polls `eden eve ssh` until EVE's sshd accepts a connection before the background ssh.sh starts generating log entries. log_test runs immediately after eve_restart in the smoke suite; the restart confirms EVE is up via Adam registration, but sshd takes additional time to become operational, causing all SSH attempts to fail with connection timeouts/resets and TestLog to time out waiting for "Disconnected" entries. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit d976210)
nginx was deployed without port forwarding, so `eden pod ps` always returned "-" in the INTERNAL column, making server_ip.sh set ESERVER_IP="-" and causing `curl -` to fail. Fix by adding -p 8080:80 to the nginx deployment so its internal IP appears in pod ps. Also replace the single-shot stale pod_ps file read in server_ip.sh with a live retry loop (up to 30x10s), matching the same race condition fix applied to networking_light.txt in f1902ce. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 57a1154)
After an EVE restart the first log-upload batch can be delayed 10+ minutes while startup logs queue up. The previous ssh.sh finished in ~2.5 minutes, leaving no new "Disconnected" entries being generated when the delivery pipeline eventually recovered. The 10m timewait expired 29 seconds before the backlog cleared in CI run 24686747135. Three changes: - timewait 10m → 15m: enough buffer for the delayed first-batch upload to deliver the "Disconnected" messages. - ssh.sh: replace the 15-iteration finite loop with a 14-minute time-bounded loop so connections are generated throughout the test window. If the pipeline recovers at any point, the most recent connection's log arrives within seconds. - exec -t 5m → 16m: raise the background exec timeout to match the new loop duration. Add trap 'exit 0' TERM INT so the process exits cleanly when testscript sends SIGINT at script end. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 0ff1c51)
After an EVE reboot the log delivery pipeline can be completely silent for 15+ minutes while startup logs drain. The previous fix (15m timewait + continuous SSH connections) still times out because the SSH disconnect entries are buried behind the startup backlog and never surface within the window. Switch to a two-phase approach: - Phase 1 (25m timewait): wait for ANY log entry to prove the pipeline is delivering fresh entries to the controller. - Phase 2 (7m timewait): only then start generating SSH connections and look for the "Disconnected" message. This ensures the Disconnected search never starts on a broken pipeline. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 5a4feb0)
Under CI load, app1's VLAN networking can take longer than 1 minute to become reachable. The original wait_and_get_recv_data.sh used a fixed count loop (12 × 5s = 60s) which was too short. Replace with a time-bounded loop (~4.5 min) and raise the enclosing exec timeout from 5m to 7m to match. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 5c6ebf7)
…vated
Regression test for EVE commit a1582bb40 ("zedmanager: fix purge stuck
when app was never activated").
The test deploys an app with a deliberately invalid image tag so that
the download fails and no domain is ever created. It then atomically
fixes the image URL, generates fresh ContentTree and Volume UUIDs (as
"eden pod purge" does), and increments the purge counter. The test
verifies the app reaches RUNNING state; without the EVE fix the app
would get stuck at LOADED with VerifyOnly=true indefinitely.
The test is added to tests/workflow/user-apps.tests.txt as entry 9,
after the nodered test.
Signed-off-by: eriknordmark <erik@zededa.com>
(cherry picked from commit 2fdadd9)
When a pod has a port mapping (e.g. -p 8080:80), eden pod ps reports the INTERNAL column as "IP:PORT". With an unassigned IP the value is "-:80", not just "-", so the existing [ "$IP" != "-" ] guard was bypassed and the placeholder "-:80" was stored as ESERVER_IP, causing curl to fail with "missing URL before --next". Strip the port suffix with cut -d: -f1 before the validity check so "-:80" reduces to "-" and triggers the retry loop as intended. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br> (cherry picked from commit a807109)
mkquery() used strings.Split(f, ":") and took only s[1], silently dropping everything after the second colon in a query argument. For patterns like 'content:Rebuilding...reasons:.updating.app.connectivity' this meant the specific reason string was never matched, causing the lim test to match any log entry with "Rebuilding intended global config, reasons:" regardless of the actual reason — including stale logs from earlier in the test run. Fix by using strings.SplitN(f, ":", 2) so the full value is preserved. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit ad20c3e)
The volumes_test creates blank-vol-1 with half the persist total, then expects blank-vol-2 (also half total) to be rejected due to insufficient space. This assumption breaks on runners with larger ZFS pools, or when ZFS uses thin-provisioned datasets (no reservation): if the remaining space after blank-vol-1 equals or exceeds half_total, EVE's space check passes and blank-vol-2 is silently created — so the expected volumeErr never arrives and the 10-minute watcher times out. Fix by: - Renaming wait-and-get-half-total.sh to wait-and-get-persist-fractions.sh and exporting three_quarter_total (3/4 of persist) in addition to half_total. - Using three_quarter_total for blank-vol-1. With blank-vol-1 at 75%, only 25% of the pool remains, which is always less than blank-vol-2's 50% — regardless of pool size or ZFS overhead. After deleting blank-vol-1 (75% freed), blank-vol-2 at 50% fits in the reclaimed space, so the rest of the test proceeds unchanged. - Increasing the errorwait timewait from 10m to 20m to absorb any remaining delay in EVE's error-detection and info-delivery path. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit c76eb7e)
The test was flaky on TPM-enabled devices because the 15 s sleep after setting remote.loglevel=info was not long enough for EVE to apply the config change. On TPM-enabled devices EVE must verify the controller's signed config before applying it; combined with the polling interval this can take several minutes, causing Phase 2's 7 m "Disconnected" detection window to open before INFO-level logging was active. Fix: - Wait for sshd availability (wait_ssh.sh) before starting the detection window, since sshd readiness confirms EVE has fully initialised. - After wait_ssh.sh succeeds, sleep 60 s so that EVE has time to fetch and apply the updated remote log level (including TPM verification). - Extend the "Disconnected" detection window from 7 m to 25 m and run the ssh connection loop for the full window. The 25 m budget absorbs any remaining propagation delay and gives ample time for sshd INFO logs to arrive at the controller. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 1cc809d)
Backport of lf-edge#1146 (commit c42bb5e). The fledge-iot.org website and package archive were decommissioned; .deb packages are no longer available. The Dockerfile must be rewritten to clone and build all fledge components directly from https://github.com/fledge-iot. Manual port: the original commit was a 60-line→180-line rewrite that does not cherry-pick cleanly because the stable branch had ubuntu:18.04 / FLEDGEVERSION 1.8.2 while the new version is ubuntu:20.04 / FLEDGEVERSION 3.1.0. Adopt master's Dockerfile verbatim, since the old one no longer builds anywhere now that fledge-iot.org is gone. Original commit message follows: The fledge-iot.org website and package archive were decommissioned; .deb packages are no longer available. Rewrite the Dockerfile to clone and build all fledge components directly from https://github.com/fledge-iot. Key changes: - Base image bumped from ubuntu:18.04 (EOL) to ubuntu:20.04 - FLEDGEVERSION defaults to 3.1.0 (latest stable) - fledge core built via requirements.sh + make install - C++ plugins built with cmake -DFLEDGE_INSTALL - Python plugins (sinusoid, http_south) installed by file copy - fledge-gui built with yarn + ./build --clean-start, served by nginx Package renames from deb to GitHub repo names: - fledge-south-modbus -> fledge-south-modbus-c - fledge-south-flirax8 -> fledge-south-FlirAX8 - fledge-south-http-south -> fledge-south-http - fledge-north-httpc -> fledge-north-http-c - fledge-filter-flirvalidity removed (no public GitHub repo) Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eden commit c76eb7e sized blank-vol-1 to 3/4 of the persist total to guarantee blank-vol-2 (half_total) cannot fit alongside it. On ZFS the v-* volumes that precede blank-vol-1 already consume some of the pool, so 3/4 of the original total can exceed actual free space — blank-vol-1 then stalls in INITIAL and TestVolStatus times out at 10m. ext4 is unaffected because file-backed volumes are sparse. Compute blank-vol-1's size dynamically from free space at the moment the script runs: half_total stays half the pool (still the size blank-vol-2 expects to fail at) and blank-vol-1 takes free - half_total + 200 MB so the remaining space is just below half_total regardless of ZFS reservation overhead. Require at least one SAFETY_MARGIN_MB of headroom over half_total before proceeding; this guarantees blank_vol_1_mb is at least 2 * SAFETY_MARGIN_MB and gives the create real breathing room for zedmanager and ZFS overhead. Bail out with a clear, actionable message naming the threshold when free space is too tight, since the test is not meaningful in that case. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 5a4e8ad)
…name The ctrl_cert_change smoke test was timing out waiting for the log message "Rebuilding intended global config, reasons: reconnecting app", which no longer exists. When ReconnectApp() in nireconciler/linux_reconciler.go was renamed to UpdateAppConn() as part of the Kubernetes CNI plugin work, the reason string passed to scheduleGlobalCfgRebuild() changed from "reconnecting app (%v)" to "updating app connectivity (%v)". The mechanism itself still works correctly: a controller cert change invalidates cipher contexts, which triggers AppNetworkConfig re-publication, which causes doUpdateActivatedAppNetwork to call UpdateAppConn in the NI reconciler, which logs the updated reason string. Update the LIM test pattern to match the current reason string. Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 12105b9)
The added test sets the remote log levels of EVE and checks the logs that arrive at the controller after a certain period of time. The test succeeds unless logs below the set remote log level are detected. Signed-off-by: Paul Gaiduk <paulg@zededa.com> (cherry picked from commit 18508e6)
Root cause ---------- TestLogLevelsDifferent (eden_newlogd smoke test) was flaky with a false failure: "Logs found, but they should not be present with the remote log level 'none'". The test flow is: 1. Set remote.loglevel=none on EVE. 2. WaitForConfigApplied polls EVE via SSH (stat -c %Y /persist/checkpoint) until the checkpoint file changes. 3. Each SSH poll session that disconnects writes a "Received disconnect" entry into EVE's sshd log — at this point the new log level is not yet applied on EVE, so newlogd persists those entries to disk. 4. Reboot EVE. 5. After reboot, newlogd replays all persisted pre-reboot logs to Adam. 6. FindLogOnAdam(LogNew) calls XRead($) on the Redis stream, blocking for the next entry after the current stream head. The replayed log from step 3 arrives in Redis just after XRead($) starts → the watcher sees it as a new, post-config-change log → test fails. Observed in CI (Smoke ext4,false): content: Received disconnect from 192.168.0.2 port 56928 timestamp: 2026-04-15 12:33:48 (generated during WaitForConfigApplied) arrived in Adam at 12:36:20 (replayed by newlogd post-reboot) XRead($) started at 12:36:19 → race lost by 1 second Fix --- * Add EdenFindLogsAfter (openevec/eden.go) and FindLogOnAdamAfter (evetestkit/utils.go): same as the existing Find* pair but the match handler returns false (skip, keep watching) for any log entry whose Timestamp is before the supplied 'after' time. Only entries at or after that timestamp trigger a match. * In TestLogLevelsDifferent, record configAppliedAt := time.Now() immediately after WaitForConfigApplied returns (the moment EVE has confirmed the new log level), then pass it to FindLogOnAdamAfter. Pre-reboot logs replayed by newlogd are silently ignored; only a log generated *after* the new config was applied counts as a violation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Renê de Souza Pinto <rene@renesp.com.br> (cherry picked from commit 0bd239a)
The test was failing with a false positive: logs appearing at Adam within seconds of the reboot completing, even though remote.loglevel was set to none. The root cause is a timing gap. WaitForConfigApplied returns as soon as /persist/checkpoint changes (~11:58:05). The reboot fires a couple of seconds later (~11:58:07). During that brief window EVE is still running and generating log messages; newlogd persists them to disk. After the reboot, newlogd replays those persisted messages to Adam immediately on reconnect. Because those logs carry timestamps from the 11:58:05-11:58:07 window — after configAppliedAt — FindLogOnAdamAfter did not filter them out, causing a spurious failure. Fix: drop configAppliedAt and instead record rebootedAt := time.Now() right after EveRebootAndWait returns. Pre-reboot log messages always carry timestamps from before the reboot (well before rebootedAt), so FindLogOnAdamAfter correctly skips them. Only logs that EVE generates after coming back up with remote.loglevel=none in effect will have timestamps at or after rebootedAt; those are the genuine violations the test is designed to catch. Signed-off-by: eriknordmark <erik@zededa.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> (cherry picked from commit 395e59f)
Contributor
|
Seeing that all tests are green on the first try - LGTM :) |
europaul
approved these changes
May 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport of test robustness fixes and a regression test from master, plus
the newlogd test suite (which 16.0 supports — see
8411d8701in EVE).PRs included
ReconnectApp→UpdateAppConnrename (the EVE-side rename is on 12.0+)purge_not_activated_testregression test (regression was introduced after 16.0, so the test serves as a regression-prevention guard here)blank-vol-1from free space18508e6(test addition) + fix: skip pre-reboot buffered logs in TestLogLevelsDifferent #1142 (skip pre-reboot buffered logs) + tests/newlogd: fix flaky TestLogLevelsDifferent #1147 (fix flakyTestLogLevelsDifferent)Not backported
Conflict resolutions
tests/eclient/testdata/userdata.txt(eclient,log_test,userdata: ssh robustness improvements #1143 d976210): existing 60sseq 30 / sleep 2retry on stable replaced by master's 4m30s time-bounded loop with explicitexit 1(andexec -t 5m). Master version strictly better.tests/volume/testdata/volumes_test.txt(tests/lim,volume: fix flaky log and storage tests #1151 c76eb7e):half_total→three_quarter_total. Took master version; the accompanyingwait-and-get-persist-fractions.shscript applied cleanly because the surrounding context matched.tests/workflow/smoke.tests.txt(newlogd test addition18508e6): master at the time of this commit had the HW inventory test (master only). Resolved by keeping HEAD's structure (no HW inventory) and inserting the new "Eden Newlogd test" between Metric and Vector, renumbering subsequent tests; total bumped from 23 to 24.Manual port: #1146 fledge
The
c42bb5echerry-pick does not apply cleanly because the stablebranch had
ubuntu:18.04 / FLEDGEVERSION 1.8.2while master is nowubuntu:20.04 / FLEDGEVERSION 3.1.0. Adopted master's Dockerfileverbatim, since the old Dockerfile no longer builds anywhere now that
fledge-iot.org is decommissioned.