Skip to content

New chunking approach that never splits encoded chunks#11060

Open
jsignell wants to merge 16 commits intopydata:mainfrom
jsignell:non-splitting-auto
Open

New chunking approach that never splits encoded chunks#11060
jsignell wants to merge 16 commits intopydata:mainfrom
jsignell:non-splitting-auto

Conversation

@jsignell
Copy link
Member

@jsignell jsignell commented Dec 30, 2025

Proposal

A new chunks option that is only allowed to use encoded chunks or multiples of them. No chunk splitting allowed.

Demo

Current behavior when chunks="auto"

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks="auto")
image

This PR introduces a new option: chunks="preserve"

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks="preserve")
image

Context

I originally set out to update the auto_chunks function in dask, but it felt like my goals were actually quite different. The goal of the dask auto_chunks function is to guarantee that the chunksize will be under a configurable limit while preserving the aspect ratio of previous_chunks (previous_chunks == encoding). This PR instead guarantees that encoded chunks are never split but it will multiply them by some factor to try to get the chunksize close to a targetsize. It doesn't try to preserve the aspect ratio of the chunks. Instead it goes after the dim where there is the greatest number of chunks and it tries to take those in bigger bites.

Also:

  • I'm not quite sure how the interface should work and I am definitely not attached to the word "preserve".
  • I originally put this in the DaskManager, but moved it to the base ChunkManagerEntrypoint class once I realized there was nothing dasky about it. I'm not sure if there is really supposed to be logic in methods on that class though.

@github-actions github-actions bot added topic-backends io topic-NamedArray Lightweight version of Variable labels Dec 30, 2025
* Move ``preserve_chunks`` to base ChunkManager class
* Get target size from dask config options for DaskManager
* Add test for open_zarr
@github-actions github-actions bot added topic-testing topic-hypothesis Strategies or tests using the hypothesis library labels Dec 31, 2025
({"x": "preserve", "y": -1}, (160, 500)),
],
)
def test_open_dataset_chunking_zarr_with_preserve(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are kind of slow.

@jsignell jsignell marked this pull request as ready for review December 31, 2025 18:46
@jsignell jsignell linked an issue Dec 31, 2025 that may be closed by this pull request
@jsignell
Copy link
Member Author

@dcherian is this something you would be able to review? I'd love to get more people trying it.

@github-actions github-actions bot added the topic-zarr Related to zarr storage library label Mar 11, 2026
New Features
~~~~~~~~~~~~

- Adds a new option ``chunks="preserve"`` when opening a dataset. This option
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should just be "auto". Are we really working around a dask bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing this work in dask, but I ended up deciding that this is a sufficiently different goal from dask auto. The goal of dask auto is to guarantee that the chunksize will be under a configurable limit while preserving the aspect ratio of previous_chunks. We don't really want either of those things.

But maybe you are just saying: this is what xarray should mean by "auto" in which case I definitely agree. I'm just not sure how to make the transition from the old version of "auto" to the new version. Maybe it would be easier to give it a new name ("preserve") and then change the default value in kwargs from chunks="auto" to chunks="preserve" at some point. If we just change what "auto" means then there is no way for people to get the dask auto behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

io topic-backends topic-hypothesis Strategies or tests using the hypothesis library topic-NamedArray Lightweight version of Variable topic-testing topic-zarr Related to zarr storage library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

opening a zarr dataset taking so much time with dask

2 participants