Skip to content

[OTEL Reinstrumentation] [RelativeLoadBalancerStrategy_Sensor, DegraderLoadBalancerStrategyV3_Sensor]#1174

Open
aadityaraj7769 wants to merge 1 commit into
linkedin:masterfrom
aadityaraj7769:adiraj/relative-degrader-loadbalancer-otel-migration
Open

[OTEL Reinstrumentation] [RelativeLoadBalancerStrategy_Sensor, DegraderLoadBalancerStrategyV3_Sensor]#1174
aadityaraj7769 wants to merge 1 commit into
linkedin:masterfrom
aadityaraj7769:adiraj/relative-degrader-loadbalancer-otel-migration

Conversation

@aadityaraj7769
Copy link
Copy Markdown
Contributor

Summary

This PR adds OpenTelemetry (OTel) metrics instrumentation to two D2 load balancer strategies: RelativeLoadBalancerStrategy and DegraderLoadBalancerStrategyV3.

Changes

  1. New Interface: Added two OTel metrics provider interfaces for collecting load balancer metrics via OpenTelemetry:
  • RelativeLoadBalancerStrategyOtelMetricsProvider — for the relative load balancer; extends the shared base and adds updateTotalHostsInAllPartitionsCount, updateUnhealthyHostsCount, and updateQuarantineHostsCount (two calls for “degraded” population: unhealthy vs quarantine, matching JMX getUnhealthyHostsCount / getQuarantineHostsCount).
  • DegraderLoadBalancerStrategyV3OtelMetricsProvider — for the degrader load balancer
  1. No-op Implementation: Added NoOpRelativeLoadBalancerStrategyOtelMetricsProvider and NoOpDegraderLoadBalancerStrategyV3OtelMetricsProvider as default implementations when metrics are disabled

  2. Integration:

  • RelativeLoadBalancerStrategy: Integrated metrics provider into StateUpdater constructor with dependency injection. Per-call host latency is emitted via TrackerClient.setPerCallDurationListener(), which fires on every individual request. Gauge metrics (host counts, quarantine counts, hash ring points) are emitted after every scheduled partition state update via emitOtelMetrics(). Added constructor overload in RelativeLoadBalancerStrategyFactory to accept a custom provider. StateUpdater invokes updateUnhealthyHostsCount and updateQuarantineHostsCount on each scheduled emit so the OTel surface matches the interface (not a single “degraded count + status” method in rest.li).
  • DegraderLoadBalancerStrategyV3: Integrated metrics provider into DegraderLoadBalancerStrategyV3 constructor with dependency injection. Per-call host latency is emitted via the same TrackerClient.setPerCallDurationListener() mechanism for newly joining tracker clients. Gauge metrics (override cluster drop rate, hash ring points) are emitted after every partition state update via emitOtelMetrics(). Added constructor overload in DegraderLoadBalancerStrategyFactoryV3 to accept a custom provider. Wired through D2ClientConfig and D2ClientBuilder with a setter (setDegraderLoadBalancerStrategyV3OtelMetricsProvider).
  • All metrics in both sensors are tagged with two dimensions: serviceName and scheme
  1. TrackerClient Extension: Added setPerCallDurationListener(Consumer) default method to the TrackerClient interface, implemented in TrackerClientImpl to fire on Rest/Stream response callbacks — shared by both sensors.

Backward Compatibility

  • Fully backward compatible — all existing constructors default to their respective NoOp providers

About RelativeLoadBalancerStrategySensor

The RelativeLoadBalancerStrategy Sensor tracks metrics for D2's relative load balancer, which dynamically adjusts host health scores (0.0–1.0) based on per-host call statistics relative to the overall cluster. It monitors host latency distributions, cluster health composition, and hash ring sizing. This sensor enables server-side observability for how traffic is being distributed across hosts in a service cluster.

New OTel Metrics

Metric Naming Pattern: `D2.RelativeLb.<Metric>

Dimensions

Attribute Key Description Applied To
D2.Service.Name The service being load-balanced All metrics
D2.Scheme Load balancer scheme (e.g., http, https) All metrics
D2.Host.Status Unhealthy or Quarantine DegradedHostsCount only

ExponentialHistogram

  • D2.RelativeLb.HostLatency (ms) — Records each host's average latency per call. OTel automatically computes p50, p90, p99, average, min, max, and standard deviation from the distribution.

Gauges

  • D2.RelativeLb.AllPartitionHostsCount ({host}) — Total number of hosts across all partitions regardless of health status; the full host population the load balancer is aware of
  • D2.RelativeLb.DegradedHostsCount ({host}) — Number of hosts in a degraded state, grouped by D2.Host.Status:
    • D2.Host.Status = Unhealthy; — hosts whose health score has been reduced due to high latency or error rate
    • D2.Host.Status = Quarantine; — hosts currently in quarantine pending health check recovery
  • D2.RelativeLb.PointsInHashRing ({point}) — Total number of points in the consistent hash ring, reflecting the effective traffic weight distribution across hosts

About DegraderLoadBalancerStrategyV3Sensor

The DegraderLoadBalancerStrategyV3 Sensor tracks metrics for D2's degrader load balancer, which uses degradation-based health tracking to adaptively route traffic. Each host is placed on a consistent hash ring with points proportional to its health — healthy hosts receive more traffic while degraded hosts are shed. When the entire cluster is unhealthy, the strategy applies a cluster-wide override drop rate that proactively drops a fraction of requests before they reach any host, protecting against cascading failures. This sensor enables server-side observability for the degrader's core health signals: host latency distribution, cluster-level drop rates, and hash ring capacity.

New OTel Metrics

Metric Naming Pattern: `D2.DegraderLb.<Metric>

Dimensions

Attribute Key Description Applied To
D2.Service.Name The service being load-balanced All metrics
D2.Scheme Load balancer scheme (e.g., http, https) All metrics

ExponentialHistogram

  • D2.DegraderLb.HostLatency (ms) — Records each host's per-call latency. OTel automatically computes standard deviation, percentiles (p50, p90, p99), average, min, and max from the distribution. These derived values map to the degrader's key health signals: LatencyStandardDeviation, MaxLatencyRelativeFactor, and Pct99LatencyRelativeFactor.

Gauges

  • D2.DegraderLb.OverrideClusterDropRate ({ratio}) — Current cluster-level override drop rate in range [0.0, 1.0]; indicates how aggressively the cluster is shedding traffic due to widespread degradation
  • D2.DegraderLb.PointsInHashRing ({point}) — Total number of points across all hosts in the consistent hash ring; a drop in total points signals that hosts are being degraded or quarantined, reducing overall cluster capacity

…duce HostStatus enum for improved health status tracking. Update Degrader and Relative load balancer strategy metrics providers to utilize HostStatus for reporting unhealthy and quarantined hosts. Enhance tests to validate new metrics behavior and ensure proper listener registration for tracker clients. Maintain backward compatibility while improving telemetry accuracy.

Co-authored-by: Cursor <cursoragent@cursor.com>
@aadityaraj7769 aadityaraj7769 changed the title Add OpenTelemetry metrics support for RelativeLoadBalancerStrategy Add OpenTelemetry metrics support for RelativeLoadBalancerStrategy_Sensor and DegraderLoadBalancerStrategyV3_Sensor May 19, 2026
@aadityaraj7769 aadityaraj7769 changed the title Add OpenTelemetry metrics support for RelativeLoadBalancerStrategy_Sensor and DegraderLoadBalancerStrategyV3_Sensor [OTEL Reinstrumentation] [RelativeLoadBalancerStrategy_Sensor, DegraderLoadBalancerStrategyV3_Sensor] May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant