Skip to content

Implement cat_ranges to optimize multi-range reads in Zonal bucket#760

Open
suni72 wants to merge 13 commits intofsspec:mainfrom
ankitaluthra1:cat_ranges-2
Open

Implement cat_ranges to optimize multi-range reads in Zonal bucket#760
suni72 wants to merge 13 commits intofsspec:mainfrom
ankitaluthra1:cat_ranges-2

Conversation

@suni72
Copy link
Contributor

@suni72 suni72 commented Feb 13, 2026

Key Changes

  • Optimized Multi-Range Reads: Implemented _cat_ranges in ExtendedGcsFileSystem to group and batch multi-range requests based on bucket type and use one mrd per unique zonal object path.
  • Utility: Added download_ranges utility in zb_hns_utils.py to encapsulate batched download logic and enforce the MRD_MAX_RANGES limit in AsyncMultiRangeDownloader.
  • Refined Slicing Semantics: Updated _process_limits_to_offset_and_length and core._cat_file to align with standard Python slicing behavior, correctly handling negative indices, zero-length, and invalid ranges without throwing errors.
  • Code Refactoring: Updated fetch_range_split to use new download_ranges utility method for improved readability and added documentation for the supports_append argument in GCSFile.
  • Testing & Validation: Added unit and integration tests for _cat_ranges, sync cat_ranges, _fetch_zonal_batch, _group_requests_by_bucket_type, and download_ranges, including specific validation for negative batch sizes and handling of mixed Zonal/Regional buckets in _cat_ranges.

suni72 and others added 10 commits February 13, 2026 07:38
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Fix bug where `start=N, end=N` caused a full file download.
- Fix bug where `start=0` or `end=0` triggered a full file download. The condition `if start or end` evaluated to False for 0; changed to explicit `is not None` checks.
- Add optimization to return empty bytes immediately for known empty ranges (start >= end).
…emantics

- Replace `ValueError` checks with Pythonic clamping logic for out-of-bound ranges (overshoot).
- Handle crossover ranges (`start > end`) by returning zero length instead of raising errors.

- Ensure negative indices are correctly converted to absolute offsets, clamping to 0 if they exceed file size.

- Update test cases to verify valid offset/length calculations for all edge cases
Add checks and tests for 1000 ranges limit on a single mrd request
@codecov
Copy link

codecov bot commented Feb 13, 2026

Codecov Report

❌ Patch coverage is 87.96296% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.20%. Comparing base (1422085) to head (7a5afa7).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
gcsfs/extended_gcsfs.py 86.81% 12 Missing ⚠️
gcsfs/core.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #760      +/-   ##
==========================================
+ Coverage   75.45%   76.20%   +0.74%     
==========================================
  Files          19       19              
  Lines        2889     2980      +91     
==========================================
+ Hits         2180     2271      +91     
  Misses        709      709              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ankitaluthra1
Copy link
Collaborator

/gcbrun

@suni72 suni72 marked this pull request as ready for review February 16, 2026 18:39
and end is not None
and start >= 0
and end >= 0
and start >= end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know any scenarios in which these invalid values would be passed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like a safety measure to accurately handle all cases in a similar way python slices work, since we don't raise error for invalid inputs. It's just a speculation but..when methods like open_parquet dynamically calculate which ranges to read for requested data, it might calculate the start or end values as negative for edge cases/corrupt files. code ref for open parquet range calculation: https://github.com/fsspec/filesystem_spec/blob/90bcbba391bddef400dde62e03c2eea9a2bdbd3d/fsspec/parquet.py#L173

generation: str
Object generation.
supports_append: bool
If True, allows opening file in append mode. This is generally not supported
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making the purpose a bit more clearer, when it should be set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comment to:

If True, allows opening file in append mode. This is generally not supported
            by GCS, but may be supported by subclasses (e.g. ZonalFile). This flag
            should be set by subclasses that support append operations. Otherwise,
            the mode will be overwritten to "wb" mode with a warning.

size = (await self._info(path))["size"] if size is None else size
offset = size + start
# If start is negative and larger than the file size, we should start from 0.
offset = max(0, size + start)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, what are scenarios in which start can be larger than file size and actully the user wanted start as 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to ensure we mimic python slicing behavior. For example, if a file is 100 bytes long, and a user (or a library) requests the last 150 bytes (file[-150:]), standard Python behavior is to cap at the beginning of the file and return the whole 100 bytes, rather than throwing an invalid input error.

file_size = mrd.persisted_size
if file_size is None:
# set file_size here to avoid network call in process_limits_to_offset_and_length
file_size = (await self._info(f"{bucket}/{object_name}"))["size"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do we need this fallback ? are there any scenarios in which this field is not present, my assumption was field would always be present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this fallback since persisted_size is an optional field and only set when stream is opened without read_handle: https://github.com/googleapis/python-storage/blob/d8dd1e074d2431de9b45e0103181dce749a447a0/google/cloud/storage/asyncio/async_read_object_stream.py#L127. Confirming this with Chandra if the field will always be present

for idx, val in batch_res:
results[idx] = val

return results
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the method is too long, consider smaller modular extractions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a helper method for creating the async tasks for zonal and regional batches

# Zonal returns a zip/list of multiple items.
# Regional (super._cat_file) returns a single bytes object.
# We normalize everything to: [(index, data), (index, data)...]
if isinstance(result, bytes):
Copy link
Collaborator

@ankitaluthra1 ankitaluthra1 Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check seems a bit fragile based on return types, consider having wrapper method with more dependable check of bucket being regional or zonal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a wrapper for normalizing the results

add wrapper for regional tasks to match zonal task return type (idx, data)
@ankitaluthra1
Copy link
Collaborator

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants