Fix and register error_message_on_failure_to_fetch_snapshot test#7733
Fix and register error_message_on_failure_to_fetch_snapshot test#7733
error_message_on_failure_to_fetch_snapshot test#7733Conversation
The test was added in #7317 but never added to a runner. The original logic was also broken: it stopped the primary before joining, so no StartupSeqnoIsOld response was ever produced and FetchSnapshot was never triggered - the expected retry/giving-up log messages could never appear. Redesign: use BackupSnapshotFetch with backup_snapshot_fetch_target_rpc_interface="primary_rpc_interface". The primary_rpc_interface lacks the SnapshotRead operator feature, so every fetch request returns HTTP 404, reliably driving the fetch_from_peer retry loop to log all three attempt messages and the "Exceeded maximum snapshot fetch retries … giving up" message. Also: - Add args.label suffix to avoid workspace collision - Inline txs construction (matches run_backup_snapshot_download style) - Use network.txs.issue instead of a separate txs variable - Collect matched patterns before removing to avoid list mutation during iteration - Register the test in schema.py alongside run_backup_snapshot_download Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
error_message_on_failure_to_fetch_snapshot test
|
@copilot Rather than adding a separate runner and network just for this single test, can we rename it to a |
Per reviewer feedback: rename run_error_message_on_failure_to_fetch_snapshot to test_error_message_on_failure_to_fetch_snapshot(network, args) and add it to the existing run_backup_snapshot_download run alongside test_backup_snapshot_fetch and test_backup_snapshot_fetch_max_size, rather than having a separate runner and network. Also remove the now-unnecessary standalone schema.py cr.add entry. Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Done in 5004695. Renamed to |
|
@copilot There's 2 kinds of "backup nodes fetching snapshots from primaries" - during join, and gossiped during runtime. Your changes have altered this test to find this error during the second phase ( |
… coverage
Adds coverage for the initial join-time FetchSnapshot failure path, which
the reviewer identified as missing.
- Add join_target_interface_name parameter to _setup_node in network.py
(defaults to FILE_SERVING_RPC_INTERFACE, preserving existing behaviour)
so tests can direct the join target to a specific interface.
- Add test_join_time_snapshot_fetch_failure(network, args):
1. Adds an intermediate node joined from snapshot (startup_seqno > 0).
2. Joins a fresh node (no snapshot) targeting that intermediate node
via primary_rpc_interface (lacks SnapshotRead). The intermediate
node returns StartupSeqnoIsOld (its startup_seqno > 0), triggering
the FetchSnapshot task.
3. FetchSnapshot GETs /node/snapshot on primary_rpc_interface → HTTP 404
on every attempt, exhausting the 3-attempt retry loop and logging
"Exceeded maximum snapshot fetch retries (3), giving up".
4. Asserts all expected retry and give-up log messages appear.
- Add call in run_backup_snapshot_download before the existing
test_error_message_on_failure_to_fetch_snapshot (BackupSnapshotFetch).
Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Added in 21b84d8. |
|
@copilot The setup and shutdown code around these tests might be different, but I feel like the core assertion logic (scanning the logs for the expected messages) should be identical. Can you factor that out into a separate function, called by both tests? |
…_messages Both test_join_time_snapshot_fetch_failure and test_error_message_on_failure_to_fetch_snapshot used identical expected_patterns lists and near-identical log-scanning loops. Factored into _assert_snapshot_fetch_failure_messages(node, timeout_s=0): - timeout_s=0 (default): single pass, for already-stopped nodes - positive timeout_s: polls until messages appear, for still-running nodes Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Done in f57db9c. Factored into |
run_error_message_on_failure_to_fetch_snapshot→test_error_message_on_failure_to_fetch_snapshot(network, args)and add torun_backup_snapshot_download(runtimeBackupSnapshotFetchfailure)schema.pyentryjoin_target_interface_nameparam to_setup_nodeinnetwork.py(defaults toFILE_SERVING_RPC_INTERFACE, no behaviour change for existing callers)test_join_time_snapshot_fetch_failure(network, args)— join-timeFetchSnapshotfailuretest_join_time_snapshot_fetch_failureinrun_backup_snapshot_download_assert_snapshot_fetch_failure_messages(node, timeout_s=0)helper, used by both testsOriginal prompt
error_message_on_failure_to_fetch_snapshottest is not run, and currently fails #7715💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.