Skip to content

Time based snapshotting#7731

Draft
cjen1-msft wants to merge 31 commits intomicrosoft:mainfrom
cjen1-msft:time-base-snapshotting
Draft

Time based snapshotting#7731
cjen1-msft wants to merge 31 commits intomicrosoft:mainfrom
cjen1-msft:time-base-snapshotting

Conversation

@cjen1-msft
Copy link
Contributor

@cjen1-msft cjen1-msft commented Mar 10, 2026

Adds time-based snapshotting.

This is configured via snapshot.min_tx_count and snapshot.time_interval.

The idea is that if time_interval and min_tx_count are exceeded we should emit a snapshot.

time_interval is reasonably self-evident, but min_tx_count allows us to avoid emitting new snapshots when no user transactions occur.
In an idle service, a snapshot requires two transactions.
One for snapshot evidence, and one for the signature for that evidence.
Hence whenever there are more than two transactions since the last signature, then there needs to be a new snapshot.

UPDATE

This seems to be largely happy and correct, but the TSAN run fails due to #7739 which is pre-existing.

@cjen1-msft cjen1-msft changed the title Time base snapshotting Time based snapshotting Mar 10, 2026
@cjen1-msft cjen1-msft marked this pull request as ready for review March 10, 2026 19:22
@cjen1-msft cjen1-msft requested a review from a team as a code owner March 10, 2026 19:22
Copilot AI review requested due to automatic review settings March 10, 2026 19:22
@cjen1-msft cjen1-msft added the run-long-test Run Long Test job label Mar 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds time-based snapshotting to CCF, controlled by snapshots.time_interval and snapshots.min_tx_count, to trigger snapshots after a time threshold while avoiding repeated snapshots when the service is idle.

Changes:

  • Extends snapshot scheduling logic to include time-based eligibility, and emits signatures when needed to complete snapshot chunking.
  • Introduces a replicated snapshot_status internal table to record snapshot scheduling metadata.
  • Updates startup/config parsing, host config schema, docs, and test infra CLI/config plumbing for the new snapshot settings.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/infra/network.py Forwards new snapshot CLI args into node configuration.
tests/infra/e2e_args.py Adds CLI switches for --snapshot-min-tx-interval and --snapshot-time-interval.
tests/e2e_operations.py Adds an e2e test function intended to exercise time-based snapshotting.
tests/config.jinja Renders new snapshot fields into node startup JSON.
src/service/tables/snapshot_status.h Adds new internal snapshot_status ServiceValue type/table name.
src/node/snapshotter.h Implements time-based snapshot scheduling state + writes snapshot_status.
src/node/node_state.h Installs a map hook to ingest replicated snapshot_status on backups; wires new snapshotter ctor args.
src/node/history.h Emits signatures when a snapshot should be scheduled even without other committable txs.
src/kv/store.h Adds should_schedule_snapshot() and changes chunking decision logic around snapshots.
src/kv/kv_types.h Extends snapshotter interface with should_schedule_snapshot.
src/common/configuration.h Adds JSON optional fields for new snapshot settings.
include/ccf/node/startup_config.h Adds snapshots.min_tx_count and snapshots.time_interval to config struct.
doc/operations/ledger_snapshot.rst Documents time-based snapshotting and warns about low min_tx_count.
doc/host_config_schema/cchost_config.json Adds schema entries for min_tx_count and time_interval.
doc/audit/builtin_maps.rst Documents the new snapshot_status builtin map.
Comments suppressed due to low confidence (1)

src/kv/store.h:1082

  • should_create_ledger_chunk_unsafe() no longer considers StoreFlag::SNAPSHOT_AT_NEXT_SIGNATURE. This changes behaviour when a snapshot is requested via tx flag/store flag but no snapshotter is configured, and it breaks the existing expectation that should_create_ledger_chunk() returns true immediately after setting this flag (see src/kv/test/kv_test.cpp "Chunk when the snapshotter requires one"). Consider restoring r = flag_enabled_unsafe(StoreFlag::SNAPSHOT_AT_NEXT_SIGNATURE) (or equivalent) so chunking is still forced when the snapshot flag is set, independent of snapshotter presence.
      bool r = false;

      if (chunker)
      {
        r |= chunker->is_chunk_end_requested(version);
      }

      if (snapshotter)
      {
        r |= snapshotter->record_committable(version);
      }

ccf::kv::AbstractStore::StoreFlag::SNAPSHOT_AT_NEXT_SIGNATURE);

REQUIRE_FALSE(record_signature(history, snapshotter, snapshot_idx));
REQUIRE(record_signature(history, snapshotter, snapshot_idx));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure why this was REQUIRE_FALSE originally.

The logic around this is that record_committable for a signature previously only returned true if the snapshot was due and not forced.
However now it returns true iff a snapshot was requested.

The argument is that there is no backlinks from the should_schedule_snapshot to the tx or store locks, so it only goes signature_lock -> snapshotter_lock
@cjen1-msft cjen1-msft removed the run-long-test Run Long Test job label Mar 12, 2026
@cjen1-msft cjen1-msft added the run-long-test Run Long Test job label Mar 13, 2026
"min_tx_count": {
"type": "integer",
"default": 2,
"description": "Minimum number of transactions that must have elapsed since the last snapshot before a time-based snapshot can be triggered. Has no effect if time_interval is not set. If lower than 2 while time_interval is enabled, snapshots will be triggered indefinitely."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should allow values that result in indefinite triggering behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is up for debate.
It is somewhat reasonable that you might want it to generate a new snapshot every day, no matter what happens on the ledger itself: for example if you have don't have good replication of the snapshots throughout the cluster (such that there is always one which is about a day old)

@cjen1-msft cjen1-msft marked this pull request as draft March 13, 2026 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-long-test Run Long Test job

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants