-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Context
In #4515, we identified that the vqueue inbox ordering breaks when multiple Bifrost records arrive within the same millisecond. The root cause is that Record.created_at is a NanosSinceEpoch which gets truncated to MillisSinceEpoch in the partition processor, then reconstructed into a UniqueTimestamp with logical clock = 0 via from_unix_millis_unchecked(). This makes same-millisecond entries indistinguishable in the vqueue inbox key ordering.
The immediate fix (#4515) adds a deterministic HLC-like counter in the state machine that increments the logical clock for same-millisecond records. The HLC state is persisted to the FSM table so that crash recovery produces correctly ordered timestamps. This works correctly but is a workaround — the information loss happens because Record.created_at is typed as NanosSinceEpoch rather than UniqueTimestamp.
Proposal
Change Record.created_at from NanosSinceEpoch to UniqueTimestamp (HLC) so the monotonic ordering guarantee lives at the Bifrost layer. This would:
- Eliminate information loss: The state machine would receive a proper
UniqueTimestampdirectly, no reconstruction needed - Benefit all consumers: Any future consumer of Record ordering gets correct monotonic timestamps for free
- Leverage existing infrastructure: The on-disk record format already supports HLC timestamps via
RecordFlags::HlcTimestamp, and both log-server and local-loglet decoders handle it - Remove the persisted HLC workaround: The state machine's HLC counter and its FSM table persistence (added in Integration test CallOrdering => ordering(boolean[], Client) fails with vqueues #4515) can be removed
Design considerations
Sequencer-assigned timestamps
Ideally, the created_at HLC timestamp should be assigned by the sequencer rather than by the record producer. Currently, NanosSinceEpoch::now() is called at the producer side (InputRecord::from(Arc<T>)), meaning different nodes can stamp records with their own wall clocks before sending to the sequencer. This causes timestamp skew — records from nodes with slightly different clocks can arrive at the sequencer in a different order than their timestamps suggest.
If the sequencer assigns the HLC timestamp, it guarantees:
- No cross-node clock skew: A single clock source determines the ordering
- Monotonicity aligned with LSN: The sequencer already assigns LSNs, so aligning HLC assignment with LSN assignment ensures the two ordering signals are consistent
- Simpler producer API: Producers don't need access to an HLC clock
The tradeoff is that created_at would reflect sequencer time rather than producer time, which may slightly affect latency metrics (write-to-read latency). This is likely acceptable since the sequencer is on the critical path anyway.
Key changes needed
- Change
Record.created_atfield type fromNanosSinceEpochtoUniqueTimestamp - Add an HLC clock to the sequencer (preferred) or Bifrost append path
- Update
InputRecordcreation to either omitcreated_at(sequencer fills it) or use a provisional value - Set
RecordFlags::HlcTimestampin on-disk encoders (infrastructure already exists) - Handle bilrost wire format compatibility (rolling upgrade)
- Update latency metrics from nanosecond to millisecond precision (acceptable)
- The state machine's
record_created_atcould becomeUniqueTimestampdirectly - Remove the persisted HLC workaround in the state machine (FSM table field
LAST_RECORD_UNIQUE_TS,StateMachine.last_record_unique_ts)
References
- Root cause analysis: Integration test CallOrdering => ordering(boolean[], Client) fails with vqueues #4515
RecordFlags::HlcTimestampalready exists inlog-server/src/rocksdb_logstore/record_format.rsandbifrost/src/providers/local_loglet/record_format.rsLogsHlcClockexists intypes/src/logs/builder.rs(used for metadata, not records — but the type can be reused)