Feat: Clean up params for zarr writing by felix0097 · Pull Request #156 · scverse/annbatch

felix0097 · 2026-03-06T14:21:32Z

This PR addresses #151

Proposes changes:

Merge dense + sparse chunk + shard params to single argument (zarr params get automically scaled by nnz elements)
Allow user to provide a zarr_shard_size in terms of GB/MB

felix0097 · 2026-03-06T14:22:12Z

+    raise ValueError(f"Cannot parse size string: {size!r}. Expected units: {', '.join(SIZE_UNITS)}")
+
+
+def _resolve_shard_obs(shard_size: int | str, elem, iospec: ad.experimental.IOSpec) -> int:


@ilan-gold does this cover all cases for an anndata?

Have a look how we do things in anndata: https://github.com/scverse/anndata/blob/a0c428379167741833eae806b18e7bda4af2b997/src/anndata/_core/anndata.py#L516-L527 - I would probably make this based on the actual class i.e., elem instead of iospec

codecov · 2026-03-06T14:23:44Z

Codecov Report

❌ Patch coverage is 84.21053% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.53%. Comparing base (8b541bf) to head (33c797e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/annbatch/io.py	84.21%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #156      +/-   ##
==========================================
- Coverage   93.89%   91.53%   -2.37%     
==========================================
  Files          11       11              
  Lines         852      886      +34     
==========================================
+ Hits          800      811      +11     
- Misses         52       75      +23

Files with missing lines	Coverage Δ
src/annbatch/io.py	`92.17% <84.21%> (-0.95%)`	⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ilan-gold · 2026-03-06T14:33:54Z

+    raise ValueError(f"Cannot parse size string: {size!r}. Expected units: {', '.join(SIZE_UNITS)}")
+
+
+def _resolve_shard_obs(shard_size: int | str, elem, iospec: ad.experimental.IOSpec) -> int:


Have a look how we do things in anndata: https://github.com/scverse/anndata/blob/a0c428379167741833eae806b18e7bda4af2b997/src/anndata/_core/anndata.py#L516-L527 - I would probably make this based on the actual class i.e., elem instead of iospec

ilan-gold · 2026-03-06T14:36:29Z

            elif iospec.encoding_type in {"csr_matrix", "csc_matrix"}:
+                nnz = elem.nnz
+                avg_nnz = nnz / elem.shape[0] if elem.shape[0] > 0 else 1.0
+                sparse_chunk = max(1, int(chunk_size * avg_nnz))


I think we should error on 0-sized sparse arrays along the first dimension or may the else branch here just the full size of the array. Is it even possible to have elem.shape[0] here?

How can chunk_size * avg_nnz ever be 0? If the data is empty?

yes, if the matrix is empty. This is actually a relatively common use case, e.g. you don't have an X but anndata requires you to have it

ilan-gold · 2026-03-06T14:37:27Z

    return num - (num % divisor)


+def _parse_size_to_bytes(size: str) -> int:


https://pypi.org/project/humanfriendly/ might be more robust

ilan-gold · 2026-03-06T14:39:22Z

Also this needs a changelog entry and passing tests

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

…oaders into ff/refactor-zarr-params

felix0097 · 2026-03-06T15:32:24Z

I've addressed all your comments @ilan-gold

ilan-gold · 2026-03-06T19:00:48Z


 - {class}`annbatch.DatasetCollection` now accepts a `rng` argument to the {meth}`annbatch.DatasetCollection.add_adatas` method.
+- The ``sparse_chunk_size``, ``sparse_shard_size``, ``dense_chunk_size``, and ``dense_shard_size`` parameters of {func}`annbatch.write_sharded` have been replaced by ``chunk_size`` (number of observations per chunk, automatically converted to element counts for sparse arrays) and ``shard_size`` (number of observations per shard or a size string). The corresponding parameters in {meth}`annbatch.DatasetCollection.add_adatas` are ``zarr_chunk_size`` and ``zarr_shard_size``.
+- `zarr_shard_size` in {meth}`annbatch.DatasetCollection.add_adatas` and `shard_size` in {func}`annbatch.write_sharded` now accept a human-readable size string (e.g. ``'1GB'``, ``'512MB'``) in addition to an integer observation count. When a string is provided, the observation count is derived independently for each array element from its uncompressed bytes-per-row so that every shard stays close to the target size.


I would note that the integer does not represent bytes but number of obs. The fact that this must be noted might be a good reason to just go with bytes and not obs (zarr-python does this)

ilan-gold · 2026-03-06T19:03:07Z

            elif iospec.encoding_type in {"csr_matrix", "csc_matrix"}:
+                nnz = elem.nnz
+                avg_nnz = nnz / elem.shape[0] if elem.shape[0] > 0 else 1.0
+                sparse_chunk = max(1, int(chunk_size * avg_nnz))


How can chunk_size * avg_nnz ever be 0? If the data is empty?

ilan-gold · 2026-03-06T19:06:27Z

-    sparse_shard_size: int = 134_217_728,
-    dense_chunk_size: int = 1024,
-    dense_shard_size: int = 4194304,
+    chunk_size: int = 64,


Suggested change

chunk_size: int = 64,

obs_per_chunk: int = 64,

and elsewhere! not just the doc string :)

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

for more information, see https://pre-commit.ci

…oaders into ff/refactor-zarr-params

ilan-gold

Let's hold off on merging until we've got all of our breaking changes in a row and then we'll merge and release

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

Clean up params for zarr writing

fe5ad84

felix0097 requested a review from ilan-gold March 6, 2026 14:21

felix0097 commented Mar 6, 2026

View reviewed changes

ilan-gold reviewed Mar 6, 2026

View reviewed changes

felix0097 and others added 4 commits March 6, 2026 16:00

Update src/annbatch/io.py

99bc83f

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

Update size calulation + size parsing

d979d5d

Add zarr param changes

62869f0

Merge branch 'ff/refactor-zarr-params' of github.com:laminlabs/arrayl…

9ee7620

…oaders into ff/refactor-zarr-params

felix0097 added the skip-gpu-ci Whether gpu ci should be skipped label Mar 6, 2026

felix0097 self-assigned this Mar 6, 2026

felix0097 added 2 commits March 6, 2026 16:22

fix readthedocs errors

cfc99d1

Fix errors

7b1893c

felix0097 requested a review from ilan-gold March 6, 2026 15:32

ilan-gold reviewed Mar 6, 2026

View reviewed changes

felix0097 and others added 4 commits March 9, 2026 14:13

Update src/annbatch/io.py

fab4eea

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

Update src/annbatch/io.py

c8088c6

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

Update src/annbatch/io.py

5dcfaf3

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

chore: update variable names + changelog

8075339

felix0097 requested a review from ilan-gold March 9, 2026 13:36

ilan-gold reviewed Mar 11, 2026

View reviewed changes

Comment thread src/annbatch/io.py Outdated

Comment thread src/annbatch/io.py Outdated

Comment thread src/annbatch/io.py Outdated

Comment thread src/annbatch/io.py Outdated

felix0097 and others added 6 commits March 11, 2026 10:53

Update src/annbatch/io.py

be69b39

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c794067

for more information, see https://pre-commit.ci

Update method params

124f2b8

Merge branch 'main' into ff/refactor-zarr-params

62afc65

Merge branch 'ff/refactor-zarr-params' of github.com:laminlabs/arrayl…

e582390

…oaders into ff/refactor-zarr-params

Fix tests

147e4a3

felix0097 requested a review from ilan-gold March 11, 2026 13:16

Rename method params

275aff7

ilan-gold approved these changes Mar 11, 2026

View reviewed changes

Comment thread CHANGELOG.md Outdated

Update CHANGELOG.md

33c797e

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

felix0097 merged commit b4b13cc into main Mar 12, 2026
12 checks passed

felix0097 deleted the ff/refactor-zarr-params branch March 12, 2026 14:48

		raise ValueError(f"Cannot parse size string: {size!r}. Expected units: {', '.join(SIZE_UNITS)}")


		def _resolve_shard_obs(shard_size: int \| str, elem, iospec: ad.experimental.IOSpec) -> int:

		return num - (num % divisor)


		def _parse_size_to_bytes(size: str) -> int:

Conversation

felix0097 commented Mar 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold commented Mar 6, 2026

Uh oh!

felix0097 commented Mar 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 6, 2026 •

edited

Loading