Skip to content

[bp/1.35] Support limiting stats number per scope#539

Merged
tedjpoole merged 2 commits into
envoyproxy:release/v1.35from
tedjpoole:backport/stats-number-limit-1.35
May 21, 2026
Merged

[bp/1.35] Support limiting stats number per scope#539
tedjpoole merged 2 commits into
envoyproxy:release/v1.35from
tedjpoole:backport/stats-number-limit-1.35

Conversation

@tedjpoole
Copy link
Copy Markdown
Contributor

kyessenov and others added 2 commits May 20, 2026 17:30
Change-Id: If3a45283b13cfda7d4f9a7bb661a1573f552ed7e
Commit Message: Introduce mark and sweep eviction of stale metrics in a
stats scope.

Additional Description: The intended use case is the high cardinality
metrics generated from the request data (e.g. [Istio standard
metrics](https://istio.io/latest/docs/reference/config/metrics/)). This
in combination with the cardinality bounds (future PR) would ensure
bounded metric resource usage. The algorithm works as follows:

1. An "evictable" scope is allocated by a filter.
2. A delta stats sink is configured, e.g. OTLP.
3. At every flush interval, a scope metric that is used (e.g. has
observed a data point) is marked as unused. A metric that has not been
used is deleted from the central caches.
4. A notification is sent to all workers to purge scope stale metrics
from their thread-local caches.
5. Once all workers complete, the unused metrics are purged from the
allocator.

There are several edge conditions that need to be explained to validate
correctness of this algorithm:

1. A worker attempting to use a stale metric after (3) but before (4)
might have its data lost. It will not be lost if 1) the same metric is
recreated in the central cache by another worker since all metrics are
uniquely indexed in the allocators; or 2) we implement deferred
allocator deletions to await for the flush operation.

2. A worker should not use a stored stale metric after (4). This
requires that workers to not store the metrics by reference (hence, this
solution will not work for most xDS metrics). Thread local cache
references are always deleted before the storage is deleted.

3. Histograms are handled slightly different because the parent
histogram needs to be "merged" to observe usage, and clearing the usage
requires updating all "children" histograms. Because we do this during
flush, merging is always done first.

4. A metric that is re-created after eviction would continue having its
start time set as the original metric. This is a limitation of Envoy
since it does not store the metric start times, but it is not an issue
with delta aggregation in OTLP. Delta is the recommended protocol for
handling high cardinality or sparse metric data. We could add start_time
in a follow-up.

Risk Level: low, requires explicit usage
Testing: unit and a load test with Istio Proxy
Docs Changes: none
Release Notes: none

---------

Signed-off-by: Kuat Yessenov <kuat@google.com>
Signed-off-by: Ted Poole <tpoole@redhat.com>
Commit Message: add support for limiting the max number of
stats(counter/gauge/histogram) per scope. This helps with memory
explosion caused by high cardinality stats.
Risk Level: low
Testing: unit test covered
Docs Changes: no doc update as it is a library support for internal
usage
Release Notes: updated
Platform Specific Features: no

---------

Signed-off-by: Xuyang Tao <taoxuy@google.com>
Signed-off-by: Ted Poole <tpoole@redhat.com>
@tedjpoole tedjpoole merged commit 530bb01 into envoyproxy:release/v1.35 May 21, 2026
2 of 3 checks passed
@tedjpoole tedjpoole deleted the backport/stats-number-limit-1.35 branch May 21, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants