storage: use max_ts for retention_ms#27383
Conversation
CI test resultstest results on build#71406
test results on build#71531
test results on build#72100
test results on build#72607test results on build#72631
test results on build#72703test results on build#72731
test results on build#72804
test results on build#72838
|
045ea4d to
79cd8d5
Compare
bharathv
left a comment
There was a problem hiding this comment.
fix makes sense to me, thanks for the PR.
Would prefer a 👍 from one of the storage experts incase I missed something.
|
I like what we've done here but we should cross reference: https://cwiki.apache.org/confluence/display/KAFKA/KIP-937%3A+Improve+Message+Timestamp+Validation which is introduced in Kafka 3.6.0. Right now, as far as I'm aware, Redpanda does not reject messages based on timestamp validation, which I'm probably happy with, but we could consider rejecting messages with a future value of > some threshold. |
Yes, I think implementing |
|
I'm wondering if we need some sort of config to maintain the old behavior on old clusters and default to the updated behavior on new clusters. |
|
we already have some configs that should honestly be cleaned up. If we want to invest in configs we should just support the kafka properties IMO |
|
From talking with @michael-redpanda, it would be nice to have another feature flag (grumble grumble) for this change that is on by default for new clusters and opt in for old ones. |
|
We could also feature flag enforcement of |
79cd8d5 to
ba1bbef
Compare
|
Force push to:
|
| release_version::v25_3_1, | ||
| "min_broker_batch_time_based_retention", | ||
| feature::min_broker_batch_time_based_retention, | ||
| feature_spec::available_policy::new_clusters_only, |
There was a problem hiding this comment.
Do we really want this for new clusters only? Or do we want it for everyone and optionally folks can turn it off if it causes issues. I personally see it as you really always want this behavior and we should eventually remove this feature flag.
There was a problem hiding this comment.
I guess I can live with this for new clusters only but I don't see why we want to keep the old behavior as opt-in.
There was a problem hiding this comment.
Do we really want this for new clusters only?
I don't have any clear idea of how many clusters have old batch timestamps whether that's due to badly produced messages or as intended- if we did have clusters like that though, they would see a lot of data vanish quickly without this feature being gated to new clusters only.
I personally see it as you really always want this behavior
Then we wouldn't need a feature flag at all, which would be nice. I'm open to this but I think we need buy-in from other members as well.
cc: @dotnwat
There was a problem hiding this comment.
but I don't see why we want to keep the old behavior as opt-in
broker_time_based_retention should be retired or subsumed by this new behavior, I agree.
There was a problem hiding this comment.
Not for this PR, and maybe the feature table isn't the right place to do this, but now that #27419 has landed, it'd be nice if we could tell new clusters to trust batch timestamps (because they're validated on the produce path) and stop having to track broker timestamps (which isn't consistent anyway, since tiered storage doesn't honor broker timestamp for retention)
| release_version::v25_3_1, | ||
| "min_broker_batch_time_based_retention", | ||
| feature::min_broker_batch_time_based_retention, | ||
| feature_spec::available_policy::new_clusters_only, |
There was a problem hiding this comment.
I guess I can live with this for new clusters only but I don't see why we want to keep the old behavior as opt-in.
@tmgstevens this was merged! 👍 #27419 |
|
needs a rebase to fix conflicts |
ba1bbef to
d80178e
Compare
Thanks, rebased and pushed. |
eb96e9e to
6326452
Compare
Retry command for Build#72607please wait until all jobs are finished before running the slash command |
|
Oops. There's more to deprecate here (mostly in the storage layer). I will come back to this soon. |
ef9b3e8 to
bd7538f
Compare
| @cluster(num_nodes=2) | ||
| @matrix(mixed_timestamps=[True, False], use_broker_timestamps=[True, False]) | ||
| def test_bogus_timestamps( | ||
| self, mixed_timestamps: bool, use_broker_timestamps: bool | ||
| ): | ||
| def test_future_timestamps(self): | ||
| """ | ||
| :param mixed_timestamps: if true, test with a mixture of valid and invalid | ||
| timestamps in the same segment (i.e. timestamp adjustment should use the | ||
| valid timestamps rather than falling back to mtime) | ||
| Ensure record timestamps in the future are respected by retention enforcement. | ||
| """ | ||
|
|
||
| # broker_time_based_retention fixes this test case for new segments. (disable it to simulate a legacy condition) |
There was a problem hiding this comment.
WDYT about keeping this test around as an upgrade test? That is, start on v25.2, and then upgrade to v25.3 -- the workings of this test from before wouldn't change, even after the upgrade
There was a problem hiding this comment.
Yeah sure why not. will push soon-ish
bd7538f to
aee838f
Compare
Retry command for Build#72703please wait until all jobs are finished before running the slash command |
aee838f to
e3bf527
Compare
rockwotj
left a comment
There was a problem hiding this comment.
LGTM, some nits on the comments
src/v/storage/disk_log_impl.cc
Outdated
| s->index().broker_timestamp(), | ||
| s); | ||
| if (!validated_timestamps) { | ||
| // Deprecated code path for >= v25.3.1. |
There was a problem hiding this comment.
This is confusing to me. This code path is deprecated, but we take it for version less than our next one.
| // Deprecated code path for >= v25.3.1. | |
| // This legacy behavior is only needed in older clusters that didn't validate timestamps. |
src/v/storage/disk_log_impl.cc
Outdated
| auto validated_timestamps = _feature_table.local().is_active( | ||
| features::feature::validated_batch_timestamps); | ||
| if (validated_timestamps) { | ||
| // Deprecated code path for >= v25.3.1. |
There was a problem hiding this comment.
Well the return path isn't deprecated right? Just the else?
There was a problem hiding this comment.
Altered comment.
e3bf527 to
fcabb56
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR modifies the storage layer's time-based retention to use max_timestamp instead of broker_timestamp when the validated_batch_timestamps feature flag is active. The change addresses unexpected behavior when using broker_time_based_retention with data from external sources (e.g., Mirror Maker 2) where original record timestamps should be preserved for retention enforcement.
- Introduces a new feature flag
validated_batch_timestampsthat controls whether to use validated max timestamps for retention - Updates time-based retention logic to prioritize
max_timestampwhen timestamp validation is enabled - Adds comprehensive tests to verify retention behavior with both past and future timestamps
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/rptest/tests/retention_policy_test.py | Adds new test class TimestampDriftRetentionTest with tests for future/past timestamp retention behavior, refactors existing BogusTimestampTest to use versioned upgrades |
| src/v/storage/segment_index.h | Updates truth table and logic to prioritize max_timestamp when validated_batch_timestamps is active, adds new configuration field |
| src/v/storage/disk_log_impl.cc | Modifies GC logic to skip bogus timestamp warnings and retention timestamp adjustments when using validated timestamps |
| src/v/features/feature_table.h | Defines new validated_batch_timestamps feature flag with new_clusters_only availability policy |
| src/v/features/feature_table.cc | Adds string representation for the new feature flag |
| src/v/storage/tests/*.cc | Updates test fixtures to provide explicit timestamp parameters and handle new retention behavior |
| src/v/cluster/archival/tests/*.cc | Updates archival tests to provide default timestamps for segment descriptors |
| log->housekeeping(ccfg).get(); | ||
|
|
||
| auto offset = last_evicted_offset.get_future().get(); | ||
| std::ignore = last_evicted_offset.get_future().get(); |
There was a problem hiding this comment.
[nitpick] Using std::ignore suggests this return value was previously used but is now discarded. Consider removing the assignment entirely or adding a comment explaining why the future result is intentionally ignored.
| std::ignore = last_evicted_offset.get_future().get(); | |
| last_evicted_offset.get_future().get(); |
|
@rockwotj feel free to force merge, latest force push is just correcting comments. |
To be used in future commit for adjusting time-based retention enforcement
behavior in local storage. The behavior will only be enabled for new clusters
created `>= v25.3.1`, but can be opted into via the admin API.
`validated_batch_timestamps` ensures that the `max_timestamp` of a batch
produced to `redpanda` has been validated by properties
`message.timestamp.{before/after}.max.ms` and unconditionally set in the
produce path (see: `v/kafka/server/handlers/produce_validation.cc`) and are valid
for retention enforcement.
The motivating case for `broker_time_based_retention` was the fact that records with bad timestamps produced in the future could lead to time-based retention being stuck indefinitely [1]. However, using the `broker_ts` can lead to unexpected behavior when e.g. replicating data from an existing cluster using MM2, as the timestamps of the Kafka records themselves are correctly preserved, but internally, `redpanda` data structures are not. To avoid the potentially curious behavior of a divergence in retention enforcement, go back to using the `max_timestamp` for batches whose timestamps have been validated and unconditionally set in the produce path (see: `v/kafka/server/handlers/produce_validation.cc`) as of `v25.3.1`. [1]: * redpanda-data#9820 * redpanda-data#12991
This stuff is all deprecated since we have timestamp validation in the produce path, and is actually quite undesirable behavior now. The cluster property `storage_ignore_timestamps_in_future_sec` should be eventually deprecated after enough time has passed (at the same time, we should set the `available_policy` of `validated_batch_timestamps` from `new_clusters_only` to `always`).
fcabb56 to
b489566
Compare
|
Force push to rebase to |
This test intended to disable local retention, but did not consider the property `retention.local.target.ms` for a tiered storage enabled topic. Previously, local retention would be enforced by broker time, but after PR redpanda-data#27383, the test became flakey due to the now "active" retention enforcement. Set `retention.local-target.ms=-1` to disable time-based retention for tiered storage enabled topics for this test.
The motivating case for
broker_time_based_retentionwas the fact that records with bad timestamps produced in the future could lead to time-based retention being stuck indefinitely [1].However, using the
broker_tscan lead to unexpected behavior when e.g. replicating data from an existing cluster using MM2, as the timestamps of the Kafka records themselves are correctly preserved, but internally,redpandadata structures are not.To avoid the potentially curious behavior of a divergence in retention enforcement, go back to using the
max_timestampfor batches whose timestamps have been validated and unconditionally set in the produce path (see:v/kafka/server/handlers/produce_validation.cc) as ofv25.3.1.All of the changes described here are featured gated by the flag
min_broker_batch_time_based_retention. This flag will only be default activated for new clusters created with minimum versionv25.3.1. Older clusters can still opt into its behavior via the admin API, e.g.[1]:
Backports Required
Release Notes
Improvements
broker_time_based_retentionis enabled for enforcement of time-based local retention in thestoragelayer.