Share sparkey readers with memory-aware heap fallback#5904
Draft
spkrka wants to merge 1 commit intospotify:mainfrom
Draft
Share sparkey readers with memory-aware heap fallback#5904spkrka wants to merge 1 commit intospotify:mainfrom
spkrka wants to merge 1 commit intospotify:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5904 +/- ##
==========================================
+ Coverage 61.54% 61.74% +0.19%
==========================================
Files 317 318 +1
Lines 11653 11711 +58
Branches 822 833 +11
==========================================
+ Hits 7172 7231 +59
+ Misses 4481 4480 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…llback - Shared reader cache: static ConcurrentHashMap deduplicates readers across all DoFn instances on a worker (was: one reader per vCPU thread). Loaded concurrently via a dedicated 4-thread daemon pool. - HostMemoryTracker: estimates off-heap budget (totalPhysical - maxHeap - 2GB) and heap budget (maxHeap - max(4GB, 10%)). Each shard atomically claims from off-heap first; if exhausted, claims heap and opens with Sparkey.reader().useHeap(true); if neither fits, falls back to mmap with a warning. - 2-phase shard opening: plan all shards first (claiming budgets), then open heap readers before mmap readers. This ensures heap readers' temporary page cache I/O doesn't evict pages that mmap readers need. - Bump sparkey 3.5.1 → 3.7.0 for the heap-backed reader API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2637a24 to
455814f
Compare
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Alternative approach to #5903 for shared sparkey readers with memory-aware heap fallback. Both PRs share the same core features (reader cache, HostMemoryTracker, heap fallback). This PR adds a cleaner architecture on top:
2-phase shard opening: plan all shards first (claiming budgets atomically), then open heap readers before mmap readers. This ensures heap readers' temporary page cache I/O (during file read into
byte[]) doesn't evict pages that mmap readers need.ShardPlancase class: each shard's plan (index, file, size, useHeap, budgeted) is a first-class value. Summary stats derived via.filter().map().sum.Single summary log per load: instead of per-shard log lines, one INFO/WARN line per sparkey load showing GiB breakdown by mode (e.g.
Loaded 256 sparkey shard(s) from gs://bucket/path/*: 150.32 GiB off-heap, 49.68 GiB on heap).planShard/openReaderhelpers: deduplicates logic between sharded and unsharded paths.Comparison with #5903
planShard/openReaderhelpersBoth are viable — #5903 is simpler, #5904 is more polished with better page cache behavior for mixed heap+mmap workloads.
Test plan
HostMemoryTracker(budget claiming, independence, edge cases)OpenWithMemoryTracking(mmap, heap, unbounded fallback modes)🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com