storage: use minimum of broker_timestamp and max_ts for retention_ms

WillemKauf · WillemKauf · commit d80178e422ab · 2025-09-09T11:41:17.000-04:00
The motivating case for `broker_time_based_retention` was the fact that records with bad timestamps produced in the future could lead to time-based retention being stuck indefinitely [1]. However, using _only_ the `broker_ts` can lead to unexpected behavior when e.g. replicating data from an existing cluster using MM2, as the timestamps of the Kafka records themselves are correctly preserved, but internally, `redpanda` data structures are not. To avoid the potentially curious behavior of a divergence in retention enforcement, take the minimum of the record's timestamp as written and the current broker time. This will achieve the original goal of preventing future timestamps from blocking retention enforcement, while also avoiding any unexpected behavior with past record timestamps. We also need to deal with the case where a client may have left a batch's `max_timestamp` unset, in which case it is marked with `{-1}` [2]. Clamp any non-positive values for `max_ts` to `broker_ts` in this case. [1]: * #9820 * #12991 [2]: * #25200
diff --git a/src/v/storage/segment_index.h b/src/v/storage/segment_index.h
@@ -36,39 +36,51 @@ using broker_timestamp_t = ss::lowres_system_clock::time_point;
 
 // clang-format off
 // this truth table shows which timestamps gets used as retention_timestamp
+// `use_min_broker_batch_ts` is only relevant iff `use_broker_ts` is also `TRUE`.
 //
-// use_broker_ts  has_broker_ts  ignore_future_ts  has_alternative_to_future_ts  retention_ts
+// use_broker_ts  has_broker_ts  ignore_future_ts  has_alternative_to_future_ts  use_min_broker_batch_ts  retention_ts
 //                               (likely false)
-// TRUE           TRUE           TRUE              TRUE                          broker_timestamp
-// TRUE           TRUE           TRUE              FALSE                         broker_timestamp
-// TRUE           TRUE           FALSE             TRUE                          broker_timestamp
-// TRUE           TRUE           FALSE             FALSE                         broker_timestamp             // new segment, new cluster
-// TRUE           FALSE          TRUE              TRUE                          segment_index::_retention_ms // buggy old segment, new cluster
-// TRUE           FALSE          TRUE              FALSE                         max_timestamp
-// TRUE           FALSE          FALSE             TRUE                          max_timestamp
-// TRUE           FALSE          FALSE             FALSE                         max_timestamp                // old segment, new cluster
-// FALSE          TRUE           TRUE              TRUE                          segment_index::_retention_ms
-// FALSE          TRUE           TRUE              FALSE                         max_timestamp
-// FALSE          TRUE           FALSE             TRUE                          max_timestamp
-// FALSE          TRUE           FALSE             FALSE                         max_timestamp                // new segment, upgraded cluster
-// FALSE          FALSE          TRUE              TRUE                          segment_index::_retention_ms // buggy old segments
-// FALSE          FALSE          TRUE              FALSE                         max_timestamp
-// FALSE          FALSE          FALSE             TRUE                          max_timestamp
-// FALSE          FALSE          FALSE             FALSE                         max_timestamp                // old segment, upgraded cluster
+// TRUE           TRUE           TRUE              TRUE                          TRUE                     min(max_timestamp, broker_timestamp)
+// TRUE           TRUE           TRUE              FALSE                         TRUE                     min(max_timestamp, broker_timestamp)
+// TRUE           TRUE           FALSE             TRUE                          TRUE                     min(max_timestamp, broker_timestamp)
+// TRUE           TRUE           FALSE             FALSE                         TRUE                     min(max_timestamp, broker_timestamp) // new segment, new cluster
+// TRUE           FALSE          TRUE              TRUE                          TRUE                     segment_index::_retention_ms         // buggy old segment, new cluster
+// TRUE           FALSE          TRUE              FALSE                         TRUE                     max_timestamp
+// TRUE           FALSE          FALSE             TRUE                          TRUE                     max_timestamp
+// TRUE           FALSE          FALSE             FALSE                         TRUE                     max_timestamp                        // old segment, new cluster
+// TRUE           TRUE           TRUE              TRUE                          FALSE                    broker_timestamp
+// TRUE           TRUE           TRUE              FALSE                         FALSE                    broker_timestamp
+// TRUE           TRUE           FALSE             TRUE                          FALSE                    broker_timestamp
+// TRUE           TRUE           FALSE             FALSE                         FALSE                    broker_timestamp                     // new segment, new cluster
+// TRUE           FALSE          TRUE              TRUE                          FALSE                    segment_index::_retention_ms         // buggy old segment, new cluster
+// TRUE           FALSE          TRUE              FALSE                         FALSE                    max_timestamp
+// TRUE           FALSE          FALSE             TRUE                          FALSE                    max_timestamp
+// TRUE           FALSE          FALSE             FALSE                         FALSE                    max_timestamp                        // old segment, new cluster
+// FALSE          TRUE           TRUE              TRUE                          -----                    segment_index::_retention_ms
+// FALSE          TRUE           TRUE              FALSE                         -----                    max_timestamp
+// FALSE          TRUE           FALSE             TRUE                          -----                    max_timestamp
+// FALSE          TRUE           FALSE             FALSE                         -----                    max_timestamp                        // new segment, upgraded cluster
+// FALSE          FALSE          TRUE              TRUE                          -----                    segment_index::_retention_ms         // buggy old segments
+// FALSE          FALSE          TRUE              FALSE                         -----                    max_timestamp
+// FALSE          FALSE          FALSE             TRUE                          -----                    max_timestamp
+// FALSE          FALSE          FALSE             FALSE                         -----                    max_timestamp                        // old segment, upgraded cluster
 // clang-format on
 
 // this struct is meant to be a local copy of the feature
 // broker_time_based_retention and configuration property
 // storage_ignore_timestamps_in_future_secs
 struct time_based_retention_cfg {
     bool use_broker_time;
+    bool use_min_broker_batch_time;
     bool use_escape_hatch_for_timestamps_in_the_future;
 
     static auto make(const features::feature_table& ft)
       -> time_based_retention_cfg {
         return {
           .use_broker_time = ft.is_active(
             features::feature::broker_time_based_retention),
+          .use_min_broker_batch_time = ft.is_active(
+            features::feature::min_broker_batch_time_based_retention),
           .use_escape_hatch_for_timestamps_in_the_future
           = config::shard_local_cfg()
               .storage_ignore_timestamps_in_future_sec()
@@ -83,7 +95,36 @@ struct time_based_retention_cfg {
       std::optional<model::timestamp> alternative_retention_ts) const noexcept {
         // new clusters and new segments should hit this branch
         if (likely(use_broker_time && broker_ts.has_value())) {
-            return *broker_ts;
+            auto ts = broker_ts.value();
+            if (use_min_broker_batch_time) {
+                // Some clients leave `max_timestamp` within a batch empty,
+                // leaving it marked as {-1} (See:
+                // https://github.com/redpanda-data/redpanda/pull/25200). Only
+                // consider positive timestamps as reasonable alternatives to
+                // `broker_ts`.
+                if (max_ts > model::timestamp{0}) {
+                    // The motivating case for `broker_time_based_retention` was
+                    // the fact that records with bad timestamps produced in the
+                    // future could lead to time-based retention being stuck
+                    // indefinitely (See:
+                    // https://github.com/redpanda-data/redpanda/issues/9820,
+                    // https://github.com/redpanda-data/redpanda/pull/12991).
+                    // However, using _only_ the `broker_ts` can lead to
+                    // unexpected behavior when e.g. replicating data from an
+                    // existing cluster using MM2, as the timestamps of the
+                    // Kafka records themselves are correctly preserved, but
+                    // internally, `redpanda` data structures are not. To avoid
+                    // the potentially curious behavior of a divergence in
+                    // retention enforcement, take the minimum of the record's
+                    // timestamp as written and the current broker time. This
+                    // will achieve the original goal of preventing future
+                    // timestamps from blocking retention enforcement, while
+                    // also avoiding any unexpected behavior with past record
+                    // timestamps.
+                    ts = std::min(max_ts, ts);
+                }
+            }
+            return ts;
         }
         // don't use broker time or no broker time available. fallback
         if (unlikely(
diff --git a/src/v/storage/tests/storage_e2e_test.cc b/src/v/storage/tests/storage_e2e_test.cc
@@ -695,29 +695,8 @@ TEST_F(storage_test_fixture, test_time_based_eviction) {
             as);
       };
 
-    // gc with timestamp -1s, no segments should be evicted
+    // gc with timestamp -1s, all segments should be evicted
     compact_and_prefix_truncate(*disk_log, make_compaction_cfg(broker_t0 - 2s));
-    ASSERT_EQ(disk_log->segments().size(), 3);
-    ASSERT_EQ(
-      disk_log->segments().front()->offsets().get_base_offset(),
-      model::offset(0));
-    ASSERT_EQ(
-      disk_log->segments().back()->offsets().get_dirty_offset(),
-      model::offset(59));
-
-    // gc with timestamp +sep/2, should evict first segment
-    compact_and_prefix_truncate(
-      *disk_log, make_compaction_cfg(broker_t0 + (broker_ts_sep / 2)));
-    ASSERT_EQ(disk_log->segments().size(), 2);
-    ASSERT_EQ(
-      disk_log->segments().front()->offsets().get_base_offset(),
-      model::offset(10));
-    ASSERT_EQ(
-      disk_log->segments().back()->offsets().get_dirty_offset(),
-      model::offset(59));
-    // gc with timestamp +sep3/2, should evict another segment
-    compact_and_prefix_truncate(
-      *disk_log, make_compaction_cfg(broker_t0 + (3 * broker_ts_sep / 2)));
     ASSERT_EQ(disk_log->segments().size(), 1);
     ASSERT_EQ(
       disk_log->segments().front()->offsets().get_base_offset(),
diff --git a/tests/rptest/tests/retention_policy_test.py b/tests/rptest/tests/retention_policy_test.py
@@ -972,3 +972,47 @@ def prefix_truncated():
         # Segments should be cleaned up now that we've switched on force-correction
         # of timestamps in the future
         self.redpanda.wait_until(prefix_truncated, timeout_sec=30, backoff_sec=1)
+
+    @cluster(num_nodes=2)
+    def test_past_timestamps(self):
+        """
+        While future record timestamps should be adjusted back to the broker
+        timestamp to avoid blocking retention enforcement, record timestamps in the past
+        should still be respected.
+        """
+
+        # Set `retention.ms` to 23 hours
+        retention_ms = 23 * 3600 * 1000
+        self.client().alter_topic_config(self.topic, "retention.ms", retention_ms)
+
+        # A fictional artificial timestamp base in milliseconds (one day previous)
+        past_timestamp = (int(time.time()) - 24 * 3600) * 1000
+
+        # Produce a run of messages with CreateTime-style timestamps, each
+        # record having a timestamp 1ms greater than the last.
+        msg_size = 14000
+        segments_count = 10
+        msg_count = (self.segment_size // msg_size) * segments_count
+
+        # Write msg_count messages with timestamps in the past
+        producer = KgoVerifierProducer(
+            context=self.test_context,
+            redpanda=self.redpanda,
+            topic=self.topic,
+            msg_size=msg_size,
+            msg_count=(self.segment_size // msg_size) * segments_count,
+            fake_timestamp_ms=past_timestamp,
+            batch_max_bytes=msg_size * 2,
+        )
+        producer.start()
+        producer.wait()
+
+        def prefix_truncated():
+            segs = self.redpanda.node_storage(self.redpanda.nodes[0]).segments(
+                "kafka", self.topic, 0
+            )
+            self.logger.debug(f"Segments: {segs}")
+            return len(segs) <= 1
+
+        # Expect to see prefix truncation of day old records with `retention.ms=23h`.
+        self.redpanda.wait_until(prefix_truncated, timeout_sec=30, backoff_sec=1)