Implement cat_ranges to optimize multi-range reads in Zonal bucket by suni72 · Pull Request #760 · fsspec/gcsfs

suni72 · 2026-02-13T17:15:08Z

Key Changes

Optimized Multi-Range Reads: Implemented _cat_ranges in ExtendedGcsFileSystem to group and batch multi-range requests based on bucket type and use one mrd per unique zonal object path.
Utility: Added download_ranges utility in zb_hns_utils.py to encapsulate batched download logic and enforce the MRD_MAX_RANGES limit in AsyncMultiRangeDownloader.
Refined Slicing Semantics: Updated _process_limits_to_offset_and_length and core._cat_file to align with standard Python slicing behavior, correctly handling negative indices, zero-length, and invalid ranges without throwing errors.
Code Refactoring: Updated fetch_range_split to use new download_ranges utility method for improved readability and added documentation for the supports_append argument in GCSFile.
Testing & Validation: Added unit and integration tests for _cat_ranges, sync cat_ranges, _fetch_zonal_batch, _group_requests_by_bucket_type, and download_ranges, including specific validation for negative batch sizes and handling of mixed Zonal/Regional buckets in _cat_ranges.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

- Fix bug where `start=N, end=N` caused a full file download. - Fix bug where `start=0` or `end=0` triggered a full file download. The condition `if start or end` evaluated to False for 0; changed to explicit `is not None` checks. - Add optimization to return empty bytes immediately for known empty ranges (start >= end).

…emantics - Replace `ValueError` checks with Pythonic clamping logic for out-of-bound ranges (overshoot). - Handle crossover ranges (`start > end`) by returning zero length instead of raising errors. - Ensure negative indices are correctly converted to absolute offsets, clamping to 0 if they exceed file size. - Update test cases to verify valid offset/length calculations for all edge cases

Add checks and tests for 1000 ranges limit on a single mrd request

codecov · 2026-02-13T17:26:28Z

Codecov Report

❌ Patch coverage is 87.96296% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.20%. Comparing base (1422085) to head (7a5afa7).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
gcsfs/extended_gcsfs.py	86.81%	12 Missing ⚠️
gcsfs/core.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #760      +/-   ##
==========================================
+ Coverage   75.45%   76.20%   +0.74%     
==========================================
  Files          19       19              
  Lines        2889     2980      +91     
==========================================
+ Hits         2180     2271      +91     
  Misses        709      709

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ankitaluthra1 · 2026-02-14T06:40:02Z

/gcbrun

ankitaluthra1 · 2026-02-19T06:39:47Z

gcsfs/core.py

+            and end is not None
+            and start >= 0
+            and end >= 0
+            and start >= end


Do we know any scenarios in which these invalid values would be passed ?

This is more like a safety measure to accurately handle all cases in a similar way python slices work, since we don't raise error for invalid inputs. It's just a speculation but..when methods like open_parquet dynamically calculate which ranges to read for requested data, it might calculate the start or end values as negative for edge cases/corrupt files. code ref for open parquet range calculation: https://github.com/fsspec/filesystem_spec/blob/90bcbba391bddef400dde62e03c2eea9a2bdbd3d/fsspec/parquet.py#L173

ankitaluthra1 · 2026-02-19T06:42:02Z

gcsfs/core.py

        generation: str
            Object generation.
+        supports_append: bool
+            If True, allows opening file in append mode. This is generally not supported


Consider making the purpose a bit more clearer, when it should be set

Updated comment to:

If True, allows opening file in append mode. This is generally not supported by GCS, but may be supported by subclasses (e.g. ZonalFile). This flag should be set by subclasses that support append operations. Otherwise, the mode will be overwritten to "wb" mode with a warning.

ankitaluthra1 · 2026-02-19T06:45:06Z

gcsfs/extended_gcsfs.py

            size = (await self._info(path))["size"] if size is None else size
-            offset = size + start
+            # If start is negative and larger than the file size, we should start from 0.
+            offset = max(0, size + start)


Again, what are scenarios in which start can be larger than file size and actully the user wanted start as 0

This is to ensure we mimic python slicing behavior. For example, if a file is 100 bytes long, and a user (or a library) requests the last 150 bytes (file[-150:]), standard Python behavior is to cap at the beginning of the file and return the whole 100 bytes, rather than throwing an invalid input error.

ankitaluthra1 · 2026-02-19T13:02:28Z

gcsfs/extended_gcsfs.py

+            file_size = mrd.persisted_size
+            if file_size is None:
+                # set file_size here to avoid network call in process_limits_to_offset_and_length
+                file_size = (await self._info(f"{bucket}/{object_name}"))["size"]


when do we need this fallback ? are there any scenarios in which this field is not present, my assumption was field would always be present

I added this fallback since persisted_size is an optional field and only set when stream is opened without read_handle: https://github.com/googleapis/python-storage/blob/d8dd1e074d2431de9b45e0103181dce749a447a0/google/cloud/storage/asyncio/async_read_object_stream.py#L127. Confirming this with Chandra if the field will always be present

ankitaluthra1 · 2026-02-20T12:07:41Z

gcsfs/extended_gcsfs.py

+            for idx, val in batch_res:
+                results[idx] = val
+
+        return results


the method is too long, consider smaller modular extractions

Added a helper method for creating the async tasks for zonal and regional batches

ankitaluthra1 · 2026-02-20T12:11:52Z

gcsfs/extended_gcsfs.py

+                # Zonal returns a zip/list of multiple items.
+                # Regional (super._cat_file) returns a single bytes object.
+                # We normalize everything to: [(index, data), (index, data)...]
+                if isinstance(result, bytes):


This check seems a bit fragile based on return types, consider having wrapper method with more dependable check of bucket being regional or zonal

Added a wrapper for normalizing the results

add wrapper for regional tasks to match zonal task return type (idx, data)

ankitaluthra1 · 2026-02-23T13:19:58Z

/gcbrun

suni72 and others added 10 commits February 13, 2026 07:38

implement cat_ranges for async gcsfs

a216949

add tests for helper methods for cat_ranges

94125d8

Add logic to check negative batch_size in _cat_ranges

06a7d55

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Add tests for negative batch size error and sync cat_ranges method

d9b643a

fix lint error and add documentation for supports_append argument

9dd8f36

Update fetch_range_split to use helper method for readability

61689b2

Add checks and tests for 1000 ranges limit on a single mrd request

Use constant for max supported ranges per request in mrd

d7656e9

Remove redundant variables

e90aa78

suni72 marked this pull request as ready for review February 16, 2026 18:39

Merge branch 'main' into cat_ranges-2

d7bf134

ankitaluthra1 requested changes Feb 19, 2026

View reviewed changes

ankitaluthra1 reviewed Feb 20, 2026

View reviewed changes

ankitaluthra1 requested changes Feb 20, 2026

View reviewed changes

suni72 added 2 commits February 23, 2026 10:42

clarify comment for supports_append arg in GCSFile

4e41372

refactor cat_ranges into smaller methods

7a5afa7

add wrapper for regional tasks to match zonal task return type (idx, data)

Conversation

suni72 commented Feb 13, 2026

Key Changes

Uh oh!

codecov bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ankitaluthra1 commented Feb 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankitaluthra1 Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankitaluthra1 commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Feb 13, 2026 •

edited

Loading

ankitaluthra1 Feb 20, 2026 •

edited

Loading