feat(metrics): add bound instruments behind experimental feature flag#3392
feat(metrics): add bound instruments behind experimental feature flag#3392bryantbiggs wants to merge 20 commits intoopen-telemetry:mainfrom
Conversation
9ce05fc to
1bb4efe
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3392 +/- ##
=======================================
+ Coverage 82.6% 82.7% +0.1%
=======================================
Files 128 128
Lines 24645 25288 +643
=======================================
+ Hits 20360 20928 +568
- Misses 4285 4360 +75 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
c3a4782 to
d0d2b7f
Compare
| attrs, | ||
| tracker.aggregator.clone_and_reset(&self.config), | ||
| )); | ||
| tracker.evicted.store(true, Ordering::Release); |
There was a problem hiding this comment.
thinking out loud:
do we need to evict bound instruments after delta collect()? Can we define the behavior as bound instruments remains forever, unless user explicitly call an unbind()?
if there are no new updates in a collect() cycle, then we need to not export anything, but value map keeps the tracker.
There was a problem hiding this comment.
Addressed in 936993f — replaced the swap-and-drain collect_and_reset with in-place delta collection that supports bound instrument persistence.
What changed:
TrackerEntry now has two status fields instead of evicted:
has_been_updated: AtomicBool— set on everymeasure()/bound.add()call, checked and reset during collect. Only entries updated since last collect are exported. No updates in a cycle = no export, but the tracker stays in the map.bound_count: AtomicUsize— incremented onbind(), decremented onDropof the bound handle. Entries withbound_count > 0are never evicted, even if they had no updates in a cycle.
Delta collect behavior:
- Iterate trackers under a read lock (no write lock needed in steady state)
- Export only entries where
has_been_updatedwas true, then reset the flag - Collect stale candidates: entries where
has_been_updated=falseANDbound_count=0AND not overflow sentinel - If stale candidates exist, acquire write lock, re-check
has_been_updatedunder the lock (TOCTOU guard against concurrentmeasure()), then evict
Bound handles no longer check an evicted flag or have a fallback path — they always go direct to the aggregator. The Drop impl decrements bound_count so the entry becomes eligible for eviction after all handles are dropped.
The drain_and_reset path is preserved for async/Observable instruments (PrecomputedSum) that need true staleness detection via map clearing.
This design was informed by a previous experiment branch (feat/delta-collect-optimization, part of #2328) which validated the has_been_updated + in-place iteration pattern on EC2 benchmarks. The bound_count mechanism extends it to prevent eviction of bound entries.
| self.count.fetch_add(1, Ordering::SeqCst); | ||
| new_tracker | ||
| } else { | ||
| // Over cardinality limit — bind to the overflow tracker |
There was a problem hiding this comment.
this is tricky...when user called bind(), we may have been over limit and it gets bound to overflow. But after a delta collect, unused points are reclaimed, but the bound instrument still goes to overflow?
Or the delta collect set evicted to true, and it fallsback to slow path?
We may have to see if bound instruments should be exempt from the cardinality limit? (that is not good and will violate spec too). Open for ideas.
There was a problem hiding this comment.
Addressed in 7b531c4 with an overflow fallback to the unbound path.
Solution: bind() returns Option, None at overflow
ValueMap::bind() now returns Option<Arc<TrackerEntry>> — None when at/over the cardinality limit. The bound handle types (BoundSumHandle, BoundHistogramHandle) use a Direct/Fallback enum:
Direct: normal case — dedicated tracker, single-digit ns, no map lookup on each callFallback: at overflow — delegates every.call()to the unboundMeasure::call()path
This gives us:
- Correct overflow attribution — same behavior as unbound measurements at overflow, no data merging between distinct bound handles sharing an overflow tracker
- Automatic recovery — when delta collect evicts stale entries and frees cardinality space,
measure.call()creates a proper entry for those attrs on the next invocation - No
bound_countinflation on the overflow tracker (was a problem with the previous approach of binding directly to it)
Trade-off: a Fallback handle is no faster than unbound (full HashMap lookup each time). But if you're at the cardinality limit when calling bind(), you're already in degraded mode — the user still gets correct data, just not the fast-path benefit. The guidance remains: bind early during initialization before cardinality fills up.
Benchmarks confirm zero regression on the Direct hot path from the enum match:
| Benchmark | Before | After |
|---|---|---|
| Counter Bound | 1.81 ns | 1.83 ns |
| Histogram Bound | 6.57 ns | 6.57 ns |
| Counter Unbound | 51.40 ns | 51.75 ns |
| Histogram Unbound | 55.91 ns | 57.35 ns |
Added 4 new tests covering overflow + delta interaction scenarios.
|
Looks very promising! Left some comments in code, and response to the open questions below. Will need to close some edge cases (cardinality cap, evict etc.). Also more test cases on this would give higher confidence! Should BoundCounter/BoundHistogram auto-rebind after delta eviction, or is explicit bind() sufficient? Should bound instruments be behind a feature flag initially? Is there interest in bound UpDownCounter or Gauge support? Naming: bind() vs with_attributes() vs something else? bind() sounds good. This was the original name for this idea when it was part of OTel spec in 2020, but got removed for scope control. |
| use opentelemetry::{metrics::MeterProvider as _, KeyValue}; | ||
| use opentelemetry_sdk::metrics::{ManualReader, SdkMeterProvider, Temporality}; | ||
|
|
||
| fn create_provider(temporality: Temporality) -> SdkMeterProvider { |
There was a problem hiding this comment.
@bryantbiggs If you can include the bench results in code comment (similar to what we do in many other benches), that'll make it easy to get a quick feel of things. PR description has it, let's just copy it here. (do include hardware info if you are comfortable with that).
The numbers shared meets our expectation: single digit ns for counters!
There is contention still, but solving that is a separate effort. (Already solved in OpenTelemetry Arrow project, we can steal it when the time comes)
There was a problem hiding this comment.
Added benchmark results as a comment block at the top of benches/bound_instruments.rs with Apple M4 Max hardware info and all benchmark numbers (Counter/Histogram unbound vs bound, plus multi-threaded bound counter at 2/4/8 threads).
Single-digit ns for bound counters as expected (~1.87 ns bound vs ~53 ns unbound = ~28x speedup). Histogram bound is ~6.6 ns vs ~59 ns unbound (~8.9x).
- Add explicit Send + Sync bounds on bind() return type for Rust 1.75 - Add TrackerMap type alias to satisfy clippy type_complexity - Remove needless borrow in benchmark
Add integration tests for BoundCounter and BoundHistogram covering cumulative, delta, and mixed bound/unbound measurement paths. Add changelog entries to both opentelemetry and opentelemetry-sdk.
d0d2b7f to
254bbf3
Compare
Benchmark Snapshot — Before Delta Collect RefactorBaseline numbers on the current Hardware: Apple M4 Max (MacBook Pro), macOS 15.4
Speedups: Counter 28.3x, Histogram 5.5x (bound vs unbound) |
…rsistence Replace swap-and-drain delta collection with in-place iteration using a has_been_updated flag on TrackerEntry. This eliminates the evicted flag and fallback path from bound instruments, and removes the second trackers_for_collect HashMap entirely. Key changes: TrackerEntry now has two status fields: - has_been_updated: AtomicBool — set on every measure()/bound.add() call, checked and reset during collect. Only entries updated since last collect are exported. This implements the "no updates = no export" semantic. - bound_count: AtomicUsize — incremented on bind(), decremented on Drop of the bound handle. Entries with bound_count > 0 are never evicted from the map, even if they had no updates in a cycle. collect_and_reset() (sync instruments, delta temporality): - Iterates trackers under a read lock (no write lock needed in steady state) - Exports only entries where has_been_updated was true, resets the flag - Evicts stale unbound entries (has_been_updated=false, bound_count=0) - Preserves bound entries and overflow sentinel indefinitely - TOCTOU re-check under write lock prevents race with concurrent measure() drain_and_reset() (async instruments like PrecomputedSum): - Preserved for Observable instruments that need true staleness detection - Uses std::mem::take instead of the old two-map swap pattern Bound instrument handles (BoundSumHandle, BoundHistogramHandle): - Always use the fast path — no eviction check, no fallback - Set has_been_updated on every call so the tracker is exported at collect - Decrement bound_count on Drop so the entry can be evicted after unbind This design was informed by the delta-collect-optimization branch (EXP-002) which validated the has_been_updated + in-place iteration pattern. The bound_count mechanism extends it to prevent eviction of bound entries, addressing the maintainer's feedback that bound instruments should persist until explicitly unbound. Part of open-telemetry#1374, open-telemetry#2328
Benchmark Snapshot — After Delta Collect Refactor (936993f)Numbers after replacing swap-and-drain Hardware: Apple M4 Max (MacBook Pro), macOS 15.4
Key observations:
Speedups vs unbound (after refactor): Counter 28.4x, Histogram 8.5x (up from 5.5x before) |
…unbound path When bind() is called at/over the cardinality limit, return None from ValueMap::bind() instead of binding to the shared overflow tracker. BoundSumHandle and BoundHistogramHandle now use a Direct/Fallback enum: - Direct: fast path with dedicated tracker (normal case) - Fallback: delegates to unbound Measure::call() path, ensuring correct overflow attribution and automatic recovery when delta collect frees space Also fixes pre-existing counter_aggregation_overflow_delta test failure caused by collect_and_reset's two-phase eviction (stale entries need a second collect cycle to be evicted, unlike the old drain_and_reset). Adds 4 new tests for overflow and delta interaction scenarios.
Benchmark Snapshot — After Overflow Fallback (7b531c4)Numbers after adding Hardware: Apple M4 Max (MacBook Pro), macOS 15.4
Key observations:
Speedups vs unbound (unchanged): Counter 28.3x, Histogram 8.7x |
Add `experimental_metrics_bound_instruments` feature flag to both opentelemetry and opentelemetry-sdk crates. All bound instrument public API (Counter::bind, Histogram::bind, BoundCounter, BoundHistogram, BoundSyncInstrument) and internal plumbing (BoundMeasure, Measure::bind, ValueMap::bind) are gated behind this flag. Follows existing naming convention (experimental_metrics_*). The testing feature enables it by default for test coverage.
… coverage Add benchmark results as code comments in benches/bound_instruments.rs with Apple M4 Max hardware info, following the project convention. Add 7 new tests for bound instruments covering: - Histogram delta temporality with reset verification - Histogram bound+unbound sharing same data point - Histogram idle delta cycle (no update = no export) - Histogram overflow fallback to unbound path - Counter drop enabling eviction of stale entries - Multiple bound handles sharing same tracker (ref-counting) - Counter with empty attributes via bind()
|
Known limitation:
This is an edge case (binding with empty attributes is unusual — the whole point of Should we: (a) document this as a known limitation for now, (b) fix it in this PR, or (c) track as a follow-up issue? |
…ported aggregators Aggregator types that don't support direct binding (ExponentialHistogram, LastValue, PrecomputedSum) now gracefully fall back to the unbound Measure::call() path instead of panicking at runtime. This prevents crashes when bind() is called on instruments with view-remapped aggregations (e.g., a Counter mapped to Base2ExponentialHistogram). Also adds otel_debug! log when bind() hits cardinality overflow, and #[must_use] on BoundCounter/BoundHistogram to catch accidental drops.
578e22a to
235ab3d
Compare
…d instruments - Use stable rustfmt import ordering for cfg-gated imports - Gate otel_debug import behind experimental_metrics_bound_instruments feature - Fix clippy iter_kv_map warning in trace_asserter (pre-existing) - Fix clippy collapsible-if and redundant-closure warnings - Update CHANGELOG.md entries for both opentelemetry and opentelemetry-sdk
235ab3d to
825dc53
Compare
| "All overflow-bound measurements should accumulate in overflow bucket" | ||
| ); | ||
|
|
||
| // Cycle 2: bound handles still work after delta collect |
There was a problem hiding this comment.
This is the only part I am not so sure of. If bind() occurred when we were in overflow, we will always remain in overflow, even though we got space cleaned up. Should not be an issue in practice, as bound instruments have very specialized use case, and the bind occur at the very start of application typically, and hence unlikely they'd go into overflow.
Leaving this comment for us to revisit in future, not a blocker for this PR.
Some options to discuss in future:
- bind() can return err/none if overflowing, so users know upfront they are not using bind() properly. This return approach is not done in OTel world so far, but bound is special and might call for exceptions.
- See if we can internally adjust bindings from overflow to normal during a delta collect.
- Exclude bind() from cardinality capping - users leveraging bind() knows that no matter what happens elsewhere, the bound ones always reports proper and never overflows.
- Other ideas.
There was a problem hiding this comment.
Captured this as a TODO comment directly in the code at the overflow None return in bind_sorted() (b7e9197), linking back to this discussion thread so it doesn't get lost (if/when) the PR merges.
| let mut stale_entries: Vec<Arc<TrackerEntry<A>>> = Vec::new(); | ||
|
|
||
| { | ||
| let Ok(trackers) = self.trackers.read() else { |
There was a problem hiding this comment.
the previous design of maintaining 2 trackers allowed us to require a lock for a single instruction swap. But now, we need read() lock held thoughout the iteration through the entries, and then write() lock throughout the stale_entry cleanups.
Given we are not clearing entires for delta aggressively in each collect, this might be less impactful than it appears. The typical pattern is to re-use majority of trackers, so they don't get evicted, and hence, update/hot-path won't need to get write lock. The new attribute set (or the ones that went unused and making a comeback), would slightly suffer as its competing for a write lock() with collect, and the time it waits is directly proportional to the overall count of trackers.
This maybe a reasonable tradeoff to accept, as previous situation was definitely bad due to force clearing after every collect(), and thereby forcing write lock() on every attributeset after every collect().
Need some more time to think through this fully, but feel free to comment if my observations are correct.
There was a problem hiding this comment.
Your observations are correct. Here's how the locking trade-off breaks down:
Old design (swap-and-drain): Write lock on trackers held only for mem::swap (pointer swap — near instant), then iteration happens on a second map. But every delta collect clears all entries, so the next measure() for every attribute set must take a write lock to re-insert — O(n) write lock acquisitions on the hot path per collection cycle.
New design (in-place iteration): Read lock held throughout iteration, then write lock only if there are stale entries to evict.
Steady state (the common case): Attribute sets are reused across cycles → zero stale entries → no write lock during collect at all. Both measure() and collect_and_reset hold only read locks and run concurrently. This is strictly better than the old design where every attr set had to re-acquire a write lock after every collect.
Edge case — new/returning attrs during collect: As you noted, a measure() that needs to insert a new entry competes for the write lock with the read lock held by collect. Wait time is proportional to total tracker count. But this only affects never-before-seen attribute sets arriving during the iteration window. Once inserted, they're back on the read-lock fast path.
Stale eviction write lock: Duration proportional to the number of stale entries (the retain() call), not total tracker count. In steady state this never fires.
Bound instruments: Completely unaffected — they hold Arc<TrackerEntry> directly, bypassing the map and all locking entirely.
Atomics safety during concurrent read-lock iteration: has_been_updated.swap(false) and the aggregator read/reset use atomics internally (AtomicI64/AtomicU64 for sums, Mutex for histograms). If a concurrent measure() sets has_been_updated back to true after the swap, that measurement is captured in the next cycle — not lost.
Net: the new design trades a brief read-lock hold during collect for eliminating O(n) write-lock acquisitions on the hot path every cycle. The only regression is the narrow window where brand-new attribute sets must wait for collect's read lock to release before inserting.
There was a problem hiding this comment.
Thanks for confirming! My conclusion is same - this is a trade-off, and I think this is a reasonable balance.
Briefly chatted with @utpilla on this as well (who has most experience with all perf things we have in this area!).
TrackerEntry::has_been_updated now starts as false (no recording yet) instead of true. The measure() paths explicitly set it to true after the first update(). This prevents phantom exports if a tracker is ever created without an immediate recording (e.g., via bind()). Also adds a TODO capturing maintainer feedback about permanent overflow-fallback for bound instruments created at cardinality limit.
Refs #1374, #2328
Summary
Adds
BoundCounterandBoundHistogramto the public API behind theexperimental_metrics_bound_instrumentsfeature flag. These types cache the resolved aggregator reference for a fixed attribute set, allowing subsequent measurements to bypass per-call sort, dedup, hash, and HashMap lookup entirely.Architecture
TrackerEntry
has_been_updatedtracks whether any measurement occurred since the last collect.bound_counttracks how many liveBoundCounter/BoundHistogramhandles reference this entry.Delta collection (
collect_and_reset)In-place iteration under a read lock (no write lock in steady state):
has_been_updatedwas true, then reset the flagbound_count > 0are never evicted — they persist across idle cycles but produce no export when there are no updateshas_been_updated == false && bound_count == 0) are evicted under a write lock with a TOCTOU re-checkCardinality overflow handling
ValueMap::bind()returnsOption<Arc<TrackerEntry>>—Nonewhen at the cardinality limit. Bound handle types use aDirect/Fallbackenum:Measure::call()pathThis gives correct overflow attribution (same behavior as unbound at overflow) with automatic recovery when delta collect frees cardinality space. An
otel_debug!log fires when fallback occurs.Aggregator types that don't support direct binding (ExponentialHistogram, LastValue, PrecomputedSum) also use the fallback path gracefully instead of panicking.
Feature flag
All public API, traits, noop impls, and internal plumbing are gated behind
experimental_metrics_bound_instruments. Thetestingfeature enables it by default.Benchmark Results
Apple M4 Max, 16 cores (12 performance + 4 efficiency), macOS 15.4:
EC2 results across 9 instance configurations (c7i/c7a/c7g × large/xlarge/4xlarge) show 5-28x counter speedup and 5-9x histogram speedup depending on architecture. Full results in the expandable sections below.
EC2 Counter Results
EC2 Histogram Results
Previous experiments (EXP-002, EXP-003) and why bound instruments are the right approach
EXP-002: Micro-optimizations on existing architecture
EXP-003: Sharded HashMap with pre-hashing
HashedAttributes::from_raw()on everymeasure()call is more expensive than the existing two-lookup pattern.Key insight
Profiling shows 78% of CPU in HashMap key hashing (52%) and key comparison (26%). The only way to meaningfully improve measure() throughput is to skip the lookup entirely. Bound instruments achieve this by moving the cost to bind-time (once) rather than measure-time (every call).
Test Coverage
15 tests covering:
Impact on existing code
Related