Conversation
Signed-off-by: seungjong bae <bcj0114@gmail.com>
|
Thanks for the suggestion, but I don't think the goal of this PR aligns with the current arrangement. This seems to contribute an alternative implementation of the existing
The main goal of Spring Framework's In general, if you need a more specialized version of this cache, please contribute it to a separate library, or better, use caffeine as this project will provide you with optimal implementations depending on your use case. Thanks! |
|
Thank you for taking the time to leave a review and provide feedback. Thank you very much! |
PR Draft: ConcurrentLruCache2
What / Why
ConcurrentLruCache2Benchmark(Throughput, Threads=8, capacity=100, missRate=0.1):ConcurrentLruCache136,826 ops/s →ConcurrentLruCache21,237,818 ops/s(~9.0× / ≈ 9.05× throughput improvement).
ConcurrentLruCache2: a performance-oriented LRU alternative that reduces read/write contention viawider striping (next power-of-two of
availableProcessors), padded per-stripe counters, and a pending-based drain strategy.Key changes
ReadOperations)availableProcessors, removing the previousmax=4cap).AtomicLongArray-based counters with per-stripe counter objects, and apply padding(
PaddedAtomicLong/PaddedLong) to mitigate false sharing on hot counter-update paths.WriteOperations)getreturnsnullon a miss (no automatic loader). Callers populate viaput.setEvictionListenerreceivingEntry(key, value)on eviction (default: no-op).Performance bottlenecks (existing) and improvements (this PR)
Bottleneck:
AtomicLongArray-based counter updatesAtomicLongArrayforrecordedCount/processedCount.Because it is backed by a contiguous primitive array, hot per-stripe counter updates may pay extra overhead and can be more
sensitive to cache-line interactions.
Improvement:
AtomicLongArray→ per-stripe counter objectsConcurrentLruCache2uses per-stripe counter objects (AtomicLong[]) instead ofAtomicLongArray,reducing dependence on a contiguous primitive array layout and aiming to lower overhead/contended updates.
Bottleneck: false sharing from contiguous counter layout
ConcurrentLruCache,ReadOperationstracks per-stripe progress viarecordedCount/processedCount/readCount. When these are stored in contiguous primitive arrays,adjacent stripes can share cache lines.
cache-line bouncing (false sharing), commonly reflected as higher backend stalls and CPI.
Validation: macOS CPU Performance Counters
Metrics (short notes)
CPI (Cycles per Instruction): average cycles per retired instruction; tends to increase with waiting/coordination overhead.
ARM_STALL_BACKEND: cycles where the pipeline backend is stalled; can increase with coherence/ownership waits.
ARM_STALL_BACKEND / Cycles: fraction of total cycles spent stalled in the backend.
ARM_L1D_CACHE_REFILL: number of L1D cache refills; churn can increase with invalidation/refill activity.
Observation: after padding, CPI, ARM_STALL_BACKEND/Cycles, and ARM_L1D_CACHE_REFILL/Instructions decreased, which is
consistent with reduced cache-line interference on the hot path.
Improvement: padded counters to mitigate false sharing
AtomicLongArray/long[]usage to per-stripe padded objects(
PaddedAtomicLong,PaddedLong) to reduce cache-line collisions between frequently-updated counters, targetinglower stalls on the
recordReadand drain-check paths.Bottleneck: limited striping in
ReadOperationsReadOperationsusesmin(4, nextPowerOfTwo(availableProcessors))(i.e., at most 4 stripes),increasing the chance of multiple threads sharing the same buffers/counters under higher thread counts.
Improvement: expand
ReadOperationsstripingConcurrentLruCache2sets the number of buffers to the next power-of-two ofavailableProcessors(removing the
max=4cap), spreading threads across more stripes and reducing contention in record/drain-check paths.Bottleneck: drains attempted on every write
ConcurrentLruCachesetsdrainStatus = REQUIREDand attempts a drain on each write (e.g.put),which can lead to frequent drain attempts and lock contention during write bursts.
Improvement: pending-based drain (
WriteOperations)ConcurrentLruCache2tracks pending write tasks; when pending is below a threshold, drains can be deferred to avoidunnecessary drain attempts.
Compatibility and migration
ConcurrentLruCache2is an additional implementation with a different operational model; it does not replaceConcurrentLruCache.ConcurrentLruCache: generator-based miss → automatic generate + populateConcurrentLruCache2: manual population (getmiss →null; caller decides whether/when toput)ConcurrentLruCache2exposesputaspublicso callers can control population.ConcurrentLruCache, or call an external loader and thenput.getalways returnsnulland entries inserted viaputare immediately evicted(effectively disabling caching).
nullfromget; stored values must be non-null.Entry(key, value)on eviction/removal/clear.ConcurrentLruCachefor auto-loader needsConcurrentLruCache2for manual population + eviction hook + lower contentionTests
./gradlew :spring-core:test(JDK 25)JAVA_HOME=/path/to/jdk25 ./gradlew :spring-core:jmhJar$JAVA_HOME/bin/java -jar spring-core/build/libs/*-jmh.jar "org.springframework.util.ConcurrentLruCache2Benchmark.*"