Skip to content

Replace NanosSinceEpoch with UniqueTimestamp (HLC) in Bifrost Record.created_at #4516

@tillrohrmann

Description

@tillrohrmann

Context

In #4515, we identified that the vqueue inbox ordering breaks when multiple Bifrost records arrive within the same millisecond. The root cause is that Record.created_at is a NanosSinceEpoch which gets truncated to MillisSinceEpoch in the partition processor, then reconstructed into a UniqueTimestamp with logical clock = 0 via from_unix_millis_unchecked(). This makes same-millisecond entries indistinguishable in the vqueue inbox key ordering.

The immediate fix (#4515) adds a deterministic HLC-like counter in the state machine that increments the logical clock for same-millisecond records. The HLC state is persisted to the FSM table so that crash recovery produces correctly ordered timestamps. This works correctly but is a workaround — the information loss happens because Record.created_at is typed as NanosSinceEpoch rather than UniqueTimestamp.

Proposal

Change Record.created_at from NanosSinceEpoch to UniqueTimestamp (HLC) so the monotonic ordering guarantee lives at the Bifrost layer. This would:

  1. Eliminate information loss: The state machine would receive a proper UniqueTimestamp directly, no reconstruction needed
  2. Benefit all consumers: Any future consumer of Record ordering gets correct monotonic timestamps for free
  3. Leverage existing infrastructure: The on-disk record format already supports HLC timestamps via RecordFlags::HlcTimestamp, and both log-server and local-loglet decoders handle it
  4. Remove the persisted HLC workaround: The state machine's HLC counter and its FSM table persistence (added in Integration test CallOrdering => ordering(boolean[], Client) fails with vqueues #4515) can be removed

Design considerations

Sequencer-assigned timestamps

Ideally, the created_at HLC timestamp should be assigned by the sequencer rather than by the record producer. Currently, NanosSinceEpoch::now() is called at the producer side (InputRecord::from(Arc<T>)), meaning different nodes can stamp records with their own wall clocks before sending to the sequencer. This causes timestamp skew — records from nodes with slightly different clocks can arrive at the sequencer in a different order than their timestamps suggest.

If the sequencer assigns the HLC timestamp, it guarantees:

  • No cross-node clock skew: A single clock source determines the ordering
  • Monotonicity aligned with LSN: The sequencer already assigns LSNs, so aligning HLC assignment with LSN assignment ensures the two ordering signals are consistent
  • Simpler producer API: Producers don't need access to an HLC clock

The tradeoff is that created_at would reflect sequencer time rather than producer time, which may slightly affect latency metrics (write-to-read latency). This is likely acceptable since the sequencer is on the critical path anyway.

Key changes needed

  • Change Record.created_at field type from NanosSinceEpoch to UniqueTimestamp
  • Add an HLC clock to the sequencer (preferred) or Bifrost append path
  • Update InputRecord creation to either omit created_at (sequencer fills it) or use a provisional value
  • Set RecordFlags::HlcTimestamp in on-disk encoders (infrastructure already exists)
  • Handle bilrost wire format compatibility (rolling upgrade)
  • Update latency metrics from nanosecond to millisecond precision (acceptable)
  • The state machine's record_created_at could become UniqueTimestamp directly
  • Remove the persisted HLC workaround in the state machine (FSM table field LAST_RECORD_UNIQUE_TS, StateMachine.last_record_unique_ts)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions