Add ConcurrentLruCache2 by codeclymb · Pull Request #36071 · spring-projects/spring-framework

codeclymb · 2025-12-24T07:23:39Z

PR Draft: ConcurrentLruCache2

What / Why

(JMH summary) ConcurrentLruCache2Benchmark (Throughput, Threads=8, capacity=100, missRate=0.1):
ConcurrentLruCache 136,826 ops/s → ConcurrentLruCache2 1,237,818 ops/s
(~9.0× / ≈ 9.05× throughput improvement).
Introduce ConcurrentLruCache2: a performance-oriented LRU alternative that reduces read/write contention via
wider striping (next power-of-two of availableProcessors), padded per-stripe counters, and a pending-based drain strategy.

Key changes

Read path (ReadOperations)
- Use wider striping (buffers sized to the next power-of-two of availableProcessors, removing the previous max=4 cap).
- Replace AtomicLongArray-based counters with per-stripe counter objects, and apply padding
  (PaddedAtomicLong / PaddedLong) to mitigate false sharing on hot counter-update paths.
Write path (WriteOperations)
- Use a pending counter to defer/trigger drains, reducing drain attempts and lock contention during write bursts.
API / behavior
- get returns null on a miss (no automatic loader). Callers populate via put.
- Provide setEvictionListener receiving Entry(key, value) on eviction (default: no-op).

Performance bottlenecks (existing) and improvements (this PR)

Bottleneck: `AtomicLongArray`-based counter updates

The existing implementation uses AtomicLongArray for recordedCount/processedCount.
Because it is backed by a contiguous primitive array, hot per-stripe counter updates may pay extra overhead and can be more
sensitive to cache-line interactions.

Improvement: `AtomicLongArray` → per-stripe counter objects

ConcurrentLruCache2 uses per-stripe counter objects (AtomicLong[]) instead of AtomicLongArray,
reducing dependence on a contiguous primitive array layout and aiming to lower overhead/contended updates.

Bottleneck: false sharing from contiguous counter layout

In the existing ConcurrentLruCache, ReadOperations tracks per-stripe progress via
recordedCount/processedCount/readCount. When these are stored in contiguous primitive arrays,
adjacent stripes can share cache lines.
Even when threads update different stripe indices, cache-line invalidation/ownership transfers can cause
cache-line bouncing (false sharing), commonly reflected as higher backend stalls and CPI.

Validation: macOS CPU Performance Counters

We compared metrics before vs after applying padding using macOS CPU PMU counters:

Metric	Without padding	With padding	Delta
Cycles	241,653,360,488	235,880,915,249	−2.39%
Instructions	93,512,843,345	205,698,798,904	+119.9%
CPI	2.5842	1.1467	−55.6%
ARM_STALL_BACKEND	211,074,700,742	182,085,792,527	−13.7%
ARM_STALL_BACKEND / Cycles	0.8735	0.7719	−11.6%
ARM_L1D_CACHE_REFILL	865,475,980	1,237,841,014	+43.0%
ARM_L1D_CACHE_REFILL / Instructions	0.009255	0.0060178	−35.0%

Metrics (short notes)

CPI (Cycles per Instruction): average cycles per retired instruction; tends to increase with waiting/coordination overhead.
ARM_STALL_BACKEND: cycles where the pipeline backend is stalled; can increase with coherence/ownership waits.
ARM_STALL_BACKEND / Cycles: fraction of total cycles spent stalled in the backend.
ARM_L1D_CACHE_REFILL: number of L1D cache refills; churn can increase with invalidation/refill activity.
Observation: after padding, CPI, ARM_STALL_BACKEND/Cycles, and ARM_L1D_CACHE_REFILL/Instructions decreased, which is
consistent with reduced cache-line interference on the hot path.

Improvement: padded counters to mitigate false sharing

Switch from contiguous AtomicLongArray/long[] usage to per-stripe padded objects
(PaddedAtomicLong, PaddedLong) to reduce cache-line collisions between frequently-updated counters, targeting
lower stalls on the recordRead and drain-check paths.

Bottleneck: limited striping in `ReadOperations`

The existing ReadOperations uses min(4, nextPowerOfTwo(availableProcessors)) (i.e., at most 4 stripes),
increasing the chance of multiple threads sharing the same buffers/counters under higher thread counts.

Improvement: expand `ReadOperations` striping

ConcurrentLruCache2 sets the number of buffers to the next power-of-two of availableProcessors
(removing the max=4 cap), spreading threads across more stripes and reducing contention in record/drain-check paths.

Bottleneck: drains attempted on every write

The existing ConcurrentLruCache sets drainStatus = REQUIRED and attempts a drain on each write (e.g. put),
which can lead to frequent drain attempts and lock contention during write bursts.

Improvement: pending-based drain (`WriteOperations`)

ConcurrentLruCache2 tracks pending write tasks; when pending is below a threshold, drains can be deferred to avoid
unnecessary drain attempts.
Each drain processes a bounded amount of work, aiming to reduce drain/lock contention during bursts.

Compatibility and migration

ConcurrentLruCache2 is an additional implementation with a different operational model; it does not replace ConcurrentLruCache.

Operational model:
- ConcurrentLruCache: generator-based miss → automatic generate + populate
- ConcurrentLruCache2: manual population (get miss → null; caller decides whether/when to put)
API (population): ConcurrentLruCache2 exposes put as public so callers can control population.
No automatic generation: generator-based flows should keep using ConcurrentLruCache, or call an external loader and then put.
Capacity 0: if created with capacity 0, get always returns null and entries inserted via put are immediately evicted
(effectively disabling caching).
Null-handling: callers must handle null from get; stored values must be non-null.
Eviction listener: default is no-op; when configured, the listener receives Entry(key, value) on eviction/removal/clear.
Choosing between implementations:
- keep ConcurrentLruCache for auto-loader needs
- choose ConcurrentLruCache2 for manual population + eviction hook + lower contention

Tests

./gradlew :spring-core:test (JDK 25)
JMH:
- JAVA_HOME=/path/to/jdk25 ./gradlew :spring-core:jmhJar
- $JAVA_HOME/bin/java -jar spring-core/build/libs/*-jmh.jar "org.springframework.util.ConcurrentLruCache2Benchmark.*"

Signed-off-by: seungjong bae <bcj0114@gmail.com>

bclozel · 2025-12-29T08:17:57Z

Thanks for the suggestion, but I don't think the goal of this PR aligns with the current arrangement. This seems to contribute an alternative implementation of the existing ConcurrentLruCache that:

has a different usage pattern (cache is not populated lazily but manually)
favors higher throughput but with higher memory cost

The main goal of Spring Framework's ConcurrentLruCache is to provide a simple, lightweight concurrent LRU cache that can be instantiated in several parts of our codebase. Its lazy loading nature and the fact that it does not fully consumes the entire resources available for higher concurrency is one of its design goals.

In general, if you need a more specialized version of this cache, please contribute it to a separate library, or better, use caffeine as this project will provide you with optimal implementations depending on your use case.

Thanks!

codeclymb · 2025-12-30T00:59:07Z

@bclozel

Thank you for taking the time to leave a review and provide feedback.
I learned that the generator exists to support lazy loading.
I also understand that the wider striping and padding I added differ from ConcurrentLruCache’s design goal of being lightweight.

Thank you very much!

Add ConcurrentLruCache2

d37982d

Signed-off-by: seungjong bae <bcj0114@gmail.com>

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Dec 24, 2025

bclozel closed this Dec 29, 2025

bclozel added status: declined A suggestion or change that we don't feel we should currently apply and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Dec 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ConcurrentLruCache2#36071

Add ConcurrentLruCache2#36071
codeclymb wants to merge 1 commit intospring-projects:mainfrom
codeclymb:concurrent-lru-cache-perf

codeclymb commented Dec 24, 2025

Uh oh!

bclozel commented Dec 29, 2025

Uh oh!

codeclymb commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

codeclymb commented Dec 24, 2025

PR Draft: ConcurrentLruCache2

What / Why

Key changes

Performance bottlenecks (existing) and improvements (this PR)

Bottleneck: AtomicLongArray-based counter updates

Improvement: AtomicLongArray → per-stripe counter objects

Bottleneck: false sharing from contiguous counter layout

Validation: macOS CPU Performance Counters

Metrics (short notes)

Improvement: padded counters to mitigate false sharing

Bottleneck: limited striping in ReadOperations

Improvement: expand ReadOperations striping

Bottleneck: drains attempted on every write

Improvement: pending-based drain (WriteOperations)

Compatibility and migration

Tests

Uh oh!

bclozel commented Dec 29, 2025

Uh oh!

codeclymb commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bottleneck: `AtomicLongArray`-based counter updates

Improvement: `AtomicLongArray` → per-stripe counter objects

Bottleneck: limited striping in `ReadOperations`

Improvement: expand `ReadOperations` striping

Improvement: pending-based drain (`WriteOperations`)