Implement cat_ranges to optimize multi-range reads in Zonal bucket#760
Implement cat_ranges to optimize multi-range reads in Zonal bucket#760suni72 wants to merge 13 commits intofsspec:mainfrom
Conversation
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Fix bug where `start=N, end=N` caused a full file download. - Fix bug where `start=0` or `end=0` triggered a full file download. The condition `if start or end` evaluated to False for 0; changed to explicit `is not None` checks. - Add optimization to return empty bytes immediately for known empty ranges (start >= end).
…emantics - Replace `ValueError` checks with Pythonic clamping logic for out-of-bound ranges (overshoot). - Handle crossover ranges (`start > end`) by returning zero length instead of raising errors. - Ensure negative indices are correctly converted to absolute offsets, clamping to 0 if they exceed file size. - Update test cases to verify valid offset/length calculations for all edge cases
Add checks and tests for 1000 ranges limit on a single mrd request
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #760 +/- ##
==========================================
+ Coverage 75.45% 76.20% +0.74%
==========================================
Files 19 19
Lines 2889 2980 +91
==========================================
+ Hits 2180 2271 +91
Misses 709 709 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/gcbrun |
| and end is not None | ||
| and start >= 0 | ||
| and end >= 0 | ||
| and start >= end |
There was a problem hiding this comment.
Do we know any scenarios in which these invalid values would be passed ?
There was a problem hiding this comment.
This is more like a safety measure to accurately handle all cases in a similar way python slices work, since we don't raise error for invalid inputs. It's just a speculation but..when methods like open_parquet dynamically calculate which ranges to read for requested data, it might calculate the start or end values as negative for edge cases/corrupt files. code ref for open parquet range calculation: https://github.com/fsspec/filesystem_spec/blob/90bcbba391bddef400dde62e03c2eea9a2bdbd3d/fsspec/parquet.py#L173
| generation: str | ||
| Object generation. | ||
| supports_append: bool | ||
| If True, allows opening file in append mode. This is generally not supported |
There was a problem hiding this comment.
Consider making the purpose a bit more clearer, when it should be set
There was a problem hiding this comment.
Updated comment to:
If True, allows opening file in append mode. This is generally not supported
by GCS, but may be supported by subclasses (e.g. ZonalFile). This flag
should be set by subclasses that support append operations. Otherwise,
the mode will be overwritten to "wb" mode with a warning.
| size = (await self._info(path))["size"] if size is None else size | ||
| offset = size + start | ||
| # If start is negative and larger than the file size, we should start from 0. | ||
| offset = max(0, size + start) |
There was a problem hiding this comment.
Again, what are scenarios in which start can be larger than file size and actully the user wanted start as 0
There was a problem hiding this comment.
This is to ensure we mimic python slicing behavior. For example, if a file is 100 bytes long, and a user (or a library) requests the last 150 bytes (file[-150:]), standard Python behavior is to cap at the beginning of the file and return the whole 100 bytes, rather than throwing an invalid input error.
| file_size = mrd.persisted_size | ||
| if file_size is None: | ||
| # set file_size here to avoid network call in process_limits_to_offset_and_length | ||
| file_size = (await self._info(f"{bucket}/{object_name}"))["size"] |
There was a problem hiding this comment.
when do we need this fallback ? are there any scenarios in which this field is not present, my assumption was field would always be present
There was a problem hiding this comment.
I added this fallback since persisted_size is an optional field and only set when stream is opened without read_handle: https://github.com/googleapis/python-storage/blob/d8dd1e074d2431de9b45e0103181dce749a447a0/google/cloud/storage/asyncio/async_read_object_stream.py#L127. Confirming this with Chandra if the field will always be present
| for idx, val in batch_res: | ||
| results[idx] = val | ||
|
|
||
| return results |
There was a problem hiding this comment.
the method is too long, consider smaller modular extractions
There was a problem hiding this comment.
Added a helper method for creating the async tasks for zonal and regional batches
gcsfs/extended_gcsfs.py
Outdated
| # Zonal returns a zip/list of multiple items. | ||
| # Regional (super._cat_file) returns a single bytes object. | ||
| # We normalize everything to: [(index, data), (index, data)...] | ||
| if isinstance(result, bytes): |
There was a problem hiding this comment.
This check seems a bit fragile based on return types, consider having wrapper method with more dependable check of bucket being regional or zonal
There was a problem hiding this comment.
Added a wrapper for normalizing the results
add wrapper for regional tasks to match zonal task return type (idx, data)
|
/gcbrun |
Key Changes
_cat_rangesin ExtendedGcsFileSystem to group and batch multi-range requests based on bucket type and use one mrd per unique zonal object path.download_rangesutility inzb_hns_utils.pyto encapsulate batched download logic and enforce theMRD_MAX_RANGESlimit inAsyncMultiRangeDownloader._process_limits_to_offset_and_lengthandcore._cat_fileto align with standard Python slicing behavior, correctly handling negative indices, zero-length, and invalid ranges without throwing errors.fetch_range_splitto use new download_ranges utility method for improved readability and added documentation for thesupports_appendargument inGCSFile._cat_ranges, synccat_ranges,_fetch_zonal_batch,_group_requests_by_bucket_type, anddownload_ranges, including specific validation for negative batch sizes and handling of mixed Zonal/Regional buckets in_cat_ranges.