Skip to content

Fix and register error_message_on_failure_to_fetch_snapshot test#7733

Draft
Copilot wants to merge 6 commits intomainfrom
copilot/add-error-message-test-run
Draft

Fix and register error_message_on_failure_to_fetch_snapshot test#7733
Copilot wants to merge 6 commits intomainfrom
copilot/add-error-message-test-run

Conversation

Copy link
Contributor

Copilot AI commented Mar 11, 2026

  • Rename run_error_message_on_failure_to_fetch_snapshottest_error_message_on_failure_to_fetch_snapshot(network, args) and add to run_backup_snapshot_download (runtime BackupSnapshotFetch failure)
  • Remove standalone schema.py entry
  • Add join_target_interface_name param to _setup_node in network.py (defaults to FILE_SERVING_RPC_INTERFACE, no behaviour change for existing callers)
  • Add test_join_time_snapshot_fetch_failure(network, args) — join-time FetchSnapshot failure
  • Add call to test_join_time_snapshot_fetch_failure in run_backup_snapshot_download
  • Factor shared log-assertion logic into _assert_snapshot_fetch_failure_messages(node, timeout_s=0) helper, used by both tests
Original prompt

This section details on the original issue you should resolve

<issue_title>error_message_on_failure_to_fetch_snapshot test is not run, and currently fails</issue_title>
<issue_description>Test was added in #7317, but not actually called. I renamed it (with other renames) in #7709, and noticed it wasn't being called.</issue_description>

<agent_instructions>Add this test to a sensible runner, near similar tests. Run it to examine the failure, work out what's broken, and fix it. May just be some drift in the error message that's printed? May be something more sinister. Good luck!</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

The test was added in #7317 but never added to a runner. The original
logic was also broken: it stopped the primary before joining, so no
StartupSeqnoIsOld response was ever produced and FetchSnapshot was never
triggered - the expected retry/giving-up log messages could never appear.

Redesign: use BackupSnapshotFetch with
backup_snapshot_fetch_target_rpc_interface="primary_rpc_interface".
The primary_rpc_interface lacks the SnapshotRead operator feature, so
every fetch request returns HTTP 404, reliably driving the
fetch_from_peer retry loop to log all three attempt messages and the
"Exceeded maximum snapshot fetch retries … giving up" message.

Also:
- Add args.label suffix to avoid workspace collision
- Inline txs construction (matches run_backup_snapshot_download style)
- Use network.txs.issue instead of a separate txs variable
- Collect matched patterns before removing to avoid list mutation during
  iteration
- Register the test in schema.py alongside run_backup_snapshot_download

Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Copilot AI changed the title [WIP] Add error message test for snapshot fetching Fix and register error_message_on_failure_to_fetch_snapshot test Mar 11, 2026
@eddyashton
Copy link
Member

@copilot Rather than adding a separate runner and network just for this single test, can we rename it to a test_ function, put it in an existing run, and simply create a new joiner pointed at the wrong interface to produce the desired errors?

Per reviewer feedback: rename run_error_message_on_failure_to_fetch_snapshot
to test_error_message_on_failure_to_fetch_snapshot(network, args) and add
it to the existing run_backup_snapshot_download run alongside
test_backup_snapshot_fetch and test_backup_snapshot_fetch_max_size, rather
than having a separate runner and network.

Also remove the now-unnecessary standalone schema.py cr.add entry.

Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 11, 2026

@copilot Rather than adding a separate runner and network just for this single test, can we rename it to a test_ function, put it in an existing run, and simply create a new joiner...

Done in 5004695. Renamed to test_error_message_on_failure_to_fetch_snapshot(network, args), added it to run_backup_snapshot_download alongside the other two snapshot-fetch tests, and dropped the separate runner and schema.py entry.

@eddyashton
Copy link
Member

@copilot There's 2 kinds of "backup nodes fetching snapshots from primaries" - during join, and gossiped during runtime. Your changes have altered this test to find this error during the second phase (BackupSnapshotFetch). That's a useful test worth adding to, but we still do not have coverage confirming these errors can be shown when the initial join fetch fails.

… coverage

Adds coverage for the initial join-time FetchSnapshot failure path, which
the reviewer identified as missing.

- Add join_target_interface_name parameter to _setup_node in network.py
  (defaults to FILE_SERVING_RPC_INTERFACE, preserving existing behaviour)
  so tests can direct the join target to a specific interface.

- Add test_join_time_snapshot_fetch_failure(network, args):
  1. Adds an intermediate node joined from snapshot (startup_seqno > 0).
  2. Joins a fresh node (no snapshot) targeting that intermediate node
     via primary_rpc_interface (lacks SnapshotRead).  The intermediate
     node returns StartupSeqnoIsOld (its startup_seqno > 0), triggering
     the FetchSnapshot task.
  3. FetchSnapshot GETs /node/snapshot on primary_rpc_interface → HTTP 404
     on every attempt, exhausting the 3-attempt retry loop and logging
     "Exceeded maximum snapshot fetch retries (3), giving up".
  4. Asserts all expected retry and give-up log messages appear.

- Add call in run_backup_snapshot_download before the existing
  test_error_message_on_failure_to_fetch_snapshot (BackupSnapshotFetch).

Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 11, 2026

@copilot There's 2 kinds of "backup nodes fetching snapshots from primaries" - during join, and gossiped during runtime. Your changes have altered this test to find this error during...

Added in 21b84d8. test_join_time_snapshot_fetch_failure adds an intermediate node from snapshot (startup_seqno > 0), then joins a fresh node targeting that intermediate via primary_rpc_interface. The intermediate returns StartupSeqnoIsOld, triggering the join-time FetchSnapshot task, which exhausts its 3 attempts against the 404 endpoint (no SnapshotRead) and logs "Exceeded maximum snapshot fetch retries". _setup_node in network.py gets a new join_target_interface_name parameter (defaulting to FILE_SERVING_RPC_INTERFACE) to enable this.

@eddyashton
Copy link
Member

@copilot The setup and shutdown code around these tests might be different, but I feel like the core assertion logic (scanning the logs for the expected messages) should be identical. Can you factor that out into a separate function, called by both tests?

…_messages

Both test_join_time_snapshot_fetch_failure and test_error_message_on_failure_to_fetch_snapshot
used identical expected_patterns lists and near-identical log-scanning loops.
Factored into _assert_snapshot_fetch_failure_messages(node, timeout_s=0):
- timeout_s=0 (default): single pass, for already-stopped nodes
- positive timeout_s: polls until messages appear, for still-running nodes

Co-authored-by: eddyashton <6000239+eddyashton@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Mar 12, 2026

@copilot The setup and shutdown code around these tests might be different, but I feel like the core assertion logic (scanning the logs for the expected messages) should be identical...

Done in f57db9c. Factored into _assert_snapshot_fetch_failure_messages(node, timeout_s=0) — single pass (default timeout_s=0) for the already-stopped join-time node, and timeout_s=30 for the still-running runtime node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

error_message_on_failure_to_fetch_snapshot test is not run, and currently fails

2 participants