New chunking approach that never splits encoded chunks by jsignell · Pull Request #11060 · pydata/xarray

jsignell · 2025-12-30T21:40:16Z

Towards opening a zarr dataset taking so much time with dask #8902
- it doesn't really close it, but it is a better alternative than chunks={} or chunks="auto". Since Remove special mapping of auto to {} in open_zarr #11010 got in xarra could eventually change the default on open_zarr to map it to "preserve" rather than to {} if dask is available.
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

Proposal

A new chunks option that is only allowed to use encoded chunks or multiples of them. No chunk splitting allowed.

Demo

Current behavior when chunks="auto"

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks="auto")

This PR introduces a new option: chunks="preserve"

ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks="preserve")

Context

I originally set out to update the auto_chunks function in dask, but it felt like my goals were actually quite different. The goal of the dask auto_chunks function is to guarantee that the chunksize will be under a configurable limit while preserving the aspect ratio of previous_chunks (previous_chunks == encoding). This PR instead guarantees that encoded chunks are never split but it will multiply them by some factor to try to get the chunksize close to a targetsize. It doesn't try to preserve the aspect ratio of the chunks. Instead it goes after the dim where there is the greatest number of chunks and it tries to take those in bigger bites.

Also:

I'm not quite sure how the interface should work and I am definitely not attached to the word "preserve".
I originally put this in the DaskManager, but moved it to the base ChunkManagerEntrypoint class once I realized there was nothing dasky about it. I'm not sure if there is really supposed to be logic in methods on that class though.

xarray/namedarray/daskmanager.py

* Move ``preserve_chunks`` to base ChunkManager class * Get target size from dask config options for DaskManager * Add test for open_zarr

jsignell · 2025-12-31T13:52:09Z

xarray/tests/test_backends.py

+        ({"x": "preserve", "y": -1}, (160, 500)),
+    ],
+)
+def test_open_dataset_chunking_zarr_with_preserve(


These tests are kind of slow.

jsignell · 2026-03-10T21:39:36Z

@dcherian is this something you would be able to review? I'd love to get more people trying it.

xarray/namedarray/parallelcompat.py

dcherian · 2026-03-13T20:57:40Z

doc/whats-new.rst

 New Features
 ~~~~~~~~~~~~

+- Adds a new option ``chunks="preserve"`` when opening a dataset. This option


IMO this should just be "auto". Are we really working around a dask bug?

I thought about doing this work in dask, but I ended up deciding that this is a sufficiently different goal from dask auto. The goal of dask auto is to guarantee that the chunksize will be under a configurable limit while preserving the aspect ratio of previous_chunks. We don't really want either of those things.

But maybe you are just saying: this is what xarray should mean by "auto" in which case I definitely agree. I'm just not sure how to make the transition from the old version of "auto" to the new version. Maybe it would be easier to give it a new name ("preserve") and then change the default value in kwargs from chunks="auto" to chunks="preserve" at some point. If we just change what "auto" means then there is no way for people to get the dask auto behavior.

xarray/namedarray/parallelcompat.py

jsignell added 5 commits December 30, 2025 10:54

Add an auto mechanism that doesn't split encoded chunks

e61da90

Forgot self

75c3f51

Change from 'auto' to 'preserve'

eecb8c5

Make sure api allows chunks='preserve'

4461f87

Add types

c58093f

github-actions bot added topic-backends io topic-NamedArray Lightweight version of Variable labels Dec 30, 2025

jsignell commented Dec 30, 2025

View reviewed changes

xarray/namedarray/daskmanager.py Outdated Show resolved Hide resolved

jsignell added 2 commits December 31, 2025 06:40

Refactor and add test

1e85c59

* Move ``preserve_chunks`` to base ChunkManager class * Get target size from dask config options for DaskManager * Add test for open_zarr

Add hypothesis testing

fccd263

github-actions bot added topic-testing topic-hypothesis Strategies or tests using the hypothesis library labels Dec 31, 2025

jsignell commented Dec 31, 2025

View reviewed changes

jsignell added 3 commits December 31, 2025 09:04

Tidy up strategy

478af0e

Fix up typing

a468c4b

Move preserve_chunks call out of normalize_chunks

20934cd

jsignell marked this pull request as ready for review December 31, 2025 18:46

jsignell linked an issue Dec 31, 2025 that may be closed by this pull request

opening a zarr dataset taking so much time with dask #8902

Open

jsignell mentioned this pull request Feb 4, 2026

opening a zarr dataset taking so much time with dask #8902

Open

Merge branch 'main' into non-splitting-auto

4841a62

jsignell mentioned this pull request Mar 10, 2026

Kerchunk file read failed when change from 2025.9.1 to 2026.2.0 #11220

Closed

5 tasks

Add docs

f70a780

github-actions bot added the topic-zarr Related to zarr storage library label Mar 11, 2026

jsignell commented Mar 13, 2026

View reviewed changes

xarray/namedarray/parallelcompat.py Outdated Show resolved Hide resolved

Improve docs

e179264

dcherian reviewed Mar 13, 2026

View reviewed changes

xarray/namedarray/parallelcompat.py Outdated Show resolved Hide resolved

dcherian reviewed Mar 13, 2026

View reviewed changes

xarray/namedarray/parallelcompat.py Outdated Show resolved Hide resolved

jsignell added 3 commits March 16, 2026 15:05

For non-uniform chunks just pass them back as is

e9fec1c

Use the last dim first to take advantage of c-ordered linearization

a300f07

Merge branch 'main' into non-splitting-auto

e6cef1c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New chunking approach that never splits encoded chunks#11060

New chunking approach that never splits encoded chunks#11060
jsignell wants to merge 16 commits intopydata:mainfrom
jsignell:non-splitting-auto

jsignell commented Dec 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

jsignell Dec 31, 2025

Uh oh!

jsignell commented Mar 10, 2026

Uh oh!

Uh oh!

dcherian Mar 13, 2026

Uh oh!

jsignell Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jsignell commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposal

Demo

Context

Uh oh!

Uh oh!

jsignell Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

jsignell commented Mar 10, 2026

Uh oh!

Uh oh!

dcherian Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

jsignell Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jsignell commented Dec 30, 2025 •

edited

Loading