Skip to content

time based retention with broker time#12991

Merged
andijcr merged 20 commits intoredpanda-data:devfrom
andijcr:feat/index_state_broker_time
Sep 28, 2023
Merged

time based retention with broker time#12991
andijcr merged 20 commits intoredpanda-data:devfrom
andijcr:feat/index_state_broker_time

Conversation

@andijcr
Copy link
Copy Markdown
Contributor

@andijcr andijcr commented Aug 24, 2023

This PR enables time-based retention to operate on broker time.

Previously, max_timestamps would be used for time-based retention.

When a new batch is indexed, the current timestamp is recorded in index_state and used when computing retention_ts of a segment.

For backward compatibility, the field is optional, and it's retained the old mechanism of storage_ignore_timestamos_in_future_sec + max_timestamp.

The broker_timestamp is always recorded for new segments, but its use is gated behind the feature broker_time_based_retention. Since this PR changes the behavior for some legitimate workloads that use timestamps in the past, this feature is on only for new clusters. For an upgraded cluster it needs to be activated via the admin API.

In some cases, when the index is lost and needs to be regenerated, the broker_timestamp is not set. A future PR will explore the impact of recording the broker timestamp in model::record_batch_header directly, to cover the case of re-indexing.

Fixes #12934 #12992

Unrelated to the main pr, but Fixes #13617

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Features

  • Time-based retention uses the broker timestamps for new data. This reduces the risk of retention not removing segments when a misbehaving client produces messages with a timestamp in the future.
    • The configuration option storage_ignore_timestamps_in_future_secs is retained to deal with bad segments produced before v23.3
    • This changes the behavior for messages with a timestamp in the past. Before, retention would use this timestamp to delete data. now, the retention window starts when the message arrives in the broker.

@andijcr andijcr force-pushed the feat/index_state_broker_time branch 3 times, most recently from 15051ae to e1d4a27 Compare August 30, 2023 12:21
@andijcr andijcr marked this pull request as ready for review August 30, 2023 13:07
@andijcr
Copy link
Copy Markdown
Contributor Author

andijcr commented Aug 30, 2023

issue is #12839

@andijcr
Copy link
Copy Markdown
Contributor Author

andijcr commented Aug 31, 2023

/ci-repeat 1

1 similar comment
@andijcr
Copy link
Copy Markdown
Contributor Author

andijcr commented Aug 31, 2023

/ci-repeat 1

@andijcr andijcr force-pushed the feat/index_state_broker_time branch 2 times, most recently from 87f907b to 4850223 Compare September 5, 2023 16:50
@andijcr
Copy link
Copy Markdown
Contributor Author

andijcr commented Sep 5, 2023

force push: comments reviews, fixed a missing broker_timestamp after self compaction

}

/// this function is the codification of the table above
constexpr auto compute_retention_ms(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how this can be constexpr if everything it takes as inputs are not known at compile time

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a force of habit when I write a method in the header. In the first version, it was truly constexpr but then I wired in feature_table and configuration so now it's extremely unlikely that it will ever be constexpr. I'll change it.


namespace storage {

// clang-format off
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is amazing. should this be surfaced in docs? cc: @Feediver1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andijcr andijcr requested a review from andrwng September 12, 2023 09:04
// this struct is meant to be a local copy of the feature
// broker_time_based_retention and configuration property
// storage_ignore_timestamps_in_future_secs
struct time_based_retention_cfg {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to follow the same logic in the cloud storage?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last discussion result was that it was too high risk. It would require a new field in cstore and change the retention behaviour for cloud (so there should be some orchestration during an upgrade). applying storage_ignore_timestamos_in_future_sec would also require some non-zero effort.

Part of the followup is to document retention cloud vs local to have a discussion about this

Lazin
Lazin previously approved these changes Sep 12, 2023
Copy link
Copy Markdown
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM: curious about the cloud storage impact, also, that model::timestamp::now should probably be replaced with seastar::lowres_system_clock::now + conversion to model::timestamp.

@andijcr
Copy link
Copy Markdown
Contributor Author

andijcr commented Sep 12, 2023

LGTM: curious about the cloud storage impact, also, that model::timestamp::now should probably be replaced with seastar::lowres_system_clock::now + conversion to model::timestamp.

adding a commit for this + static assert to ensure that the underlying datatype is stable

@andijcr
Copy link
Copy Markdown
Contributor Author

andijcr commented Sep 15, 2023

force push to fix merge conflics.

since i had to do it, i added a last commit to use ss::lowres_system_clock
f8e6467

@Lazin
@StephanDollberg

@andijcr andijcr requested a review from Lazin September 15, 2023 09:49
@andijcr andijcr force-pushed the feat/index_state_broker_time branch from 6e384a8 to e6443ee Compare September 28, 2023 10:30
@andijcr
Copy link
Copy Markdown
Contributor Author

andijcr commented Sep 28, 2023

last rebase was to fix the merge conflict for a spot in feature_table. ci is green

@andijcr andijcr merged commit 0a0a5d5 into redpanda-data:dev Sep 28, 2023
@andijcr andijcr deleted the feat/index_state_broker_time branch September 28, 2023 15:37
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Aug 26, 2025
…ion_ms`

The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using _only_ the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, take the minimum of the record's timestamp
as written and the current broker time. This will achieve the
original goal of preventing future timestamps from blocking
retention enforcement, while also avoiding any unexpected
behavior with past record timestamps.

[1]:
* redpanda-data#9820
* redpanda-data#12991
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Aug 26, 2025
…ion_ms`

The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using _only_ the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, take the minimum of the record's timestamp
as written and the current broker time. This will achieve the
original goal of preventing future timestamps from blocking
retention enforcement, while also avoiding any unexpected
behavior with past record timestamps.

We also need to deal with the case where a client may have left
a batch's `max_timestamp` unset, in which case it is marked with `{-1}` [2].
Clamp any non-positive values for `max_ts` to `broker_ts` in this case.

[1]:
* redpanda-data#9820
* redpanda-data#12991

[2]:
* redpanda-data#25200
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Aug 28, 2025
…ion_ms`

The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using _only_ the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, take the minimum of the record's timestamp
as written and the current broker time. This will achieve the
original goal of preventing future timestamps from blocking
retention enforcement, while also avoiding any unexpected
behavior with past record timestamps.

We also need to deal with the case where a client may have left
a batch's `max_timestamp` unset, in which case it is marked with `{-1}` [2].
Clamp any non-positive values for `max_ts` to `broker_ts` in this case.

[1]:
* redpanda-data#9820
* redpanda-data#12991

[2]:
* redpanda-data#25200
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Sep 9, 2025
…ion_ms`

The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using _only_ the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, take the minimum of the record's timestamp
as written and the current broker time. This will achieve the
original goal of preventing future timestamps from blocking
retention enforcement, while also avoiding any unexpected
behavior with past record timestamps.

We also need to deal with the case where a client may have left
a batch's `max_timestamp` unset, in which case it is marked with `{-1}` [2].
Clamp any non-positive values for `max_ts` to `broker_ts` in this case.

[1]:
* redpanda-data#9820
* redpanda-data#12991

[2]:
* redpanda-data#25200
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Sep 9, 2025
…ion_ms`

The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using _only_ the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, take the minimum of the record's timestamp
as written and the current broker time. This will achieve the
original goal of preventing future timestamps from blocking
retention enforcement, while also avoiding any unexpected
behavior with past record timestamps.

We also need to deal with the case where a client may have left
a batch's `max_timestamp` unset, in which case it is marked with `{-1}` [2].
Clamp any non-positive values for `max_ts` to `broker_ts` in this case.

[1]:
* redpanda-data#9820
* redpanda-data#12991

[2]:
* redpanda-data#25200
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Sep 11, 2025
…ion_ms`

The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using _only_ the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, take the minimum of the record's timestamp
as written and the current broker time. This will achieve the
original goal of preventing future timestamps from blocking
retention enforcement, while also avoiding any unexpected
behavior with past record timestamps.

We also need to deal with the case where a client may have left
a batch's `max_timestamp` unset, in which case it is marked with `{-1}` [2].
Clamp any non-positive values for `max_ts` to `broker_ts` in this case.

[1]:
* redpanda-data#9820
* redpanda-data#12991

[2]:
* redpanda-data#25200
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Sep 17, 2025
The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, go back to using the `max_timestamp`
for batches whose timestamps have been validated and unconditionally
set in the produce path (see: `v/kafka/protocol/kafka_batch_adapter.cc`)
as of `v25.3.1`.

[1]:
* redpanda-data#9820
* redpanda-data#12991
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Sep 17, 2025
The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, go back to using the `max_timestamp`
for batches whose timestamps have been validated and unconditionally
set in the produce path (see: `v/kafka/server/handlers/produce_validation.cc`)
as of `v25.3.1`.

[1]:
* redpanda-data#9820
* redpanda-data#12991
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Sep 17, 2025
The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, go back to using the `max_timestamp`
for batches whose timestamps have been validated and unconditionally
set in the produce path (see: `v/kafka/server/handlers/produce_validation.cc`)
as of `v25.3.1`.

[1]:
* redpanda-data#9820
* redpanda-data#12991
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Sep 18, 2025
The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, go back to using the `max_timestamp`
for batches whose timestamps have been validated and unconditionally
set in the produce path (see: `v/kafka/server/handlers/produce_validation.cc`)
as of `v25.3.1`.

Of course, users can opt out of safety from `max_timestamp=-1` with `legacy`
validation mode, in which case we may still want to use the broker time for
retention enforcement rather than garbage collecting everything based on
`max_timestamp=-1`.

[1]:
* redpanda-data#9820
* redpanda-data#12991
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Sep 19, 2025
The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, go back to using the `max_timestamp`
for batches whose timestamps have been validated and unconditionally
set in the produce path (see: `v/kafka/server/handlers/produce_validation.cc`)
as of `v25.3.1`.

[1]:
* redpanda-data#9820
* redpanda-data#12991
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Sep 19, 2025
The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, go back to using the `max_timestamp`
for batches whose timestamps have been validated and unconditionally
set in the produce path (see: `v/kafka/server/handlers/produce_validation.cc`)
as of `v25.3.1`.

[1]:
* redpanda-data#9820
* redpanda-data#12991
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Sep 20, 2025
The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, go back to using the `max_timestamp`
for batches whose timestamps have been validated and unconditionally
set in the produce path (see: `v/kafka/server/handlers/produce_validation.cc`)
as of `v25.3.1`.

[1]:
* redpanda-data#9820
* redpanda-data#12991
WillemKauf added a commit to WillemKauf/redpanda that referenced this pull request Sep 24, 2025
The motivating case for `broker_time_based_retention` was the
fact that records with bad timestamps produced in the future
could lead to time-based retention being stuck indefinitely [1].

However, using the `broker_ts` can lead to unexpected
behavior when e.g. replicating data from an existing cluster
using MM2, as the timestamps of the Kafka records themselves are
correctly preserved, but internally, `redpanda` data structures
are not.

To avoid the potentially curious behavior of a divergence in
retention enforcement, go back to using the `max_timestamp`
for batches whose timestamps have been validated and unconditionally
set in the produce path (see: `v/kafka/server/handlers/produce_validation.cc`)
as of `v25.3.1`.

[1]:
* redpanda-data#9820
* redpanda-data#12991
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI Failure (Exceptional future ignored) in cluster_cloud_metadata_rpfixture record broker time in storage::index_state

9 participants