Skip to content

Feat: Clean up params for zarr writing#156

Merged
felix0097 merged 19 commits intomainfrom
ff/refactor-zarr-params
Mar 12, 2026
Merged

Feat: Clean up params for zarr writing#156
felix0097 merged 19 commits intomainfrom
ff/refactor-zarr-params

Conversation

@felix0097
Copy link
Copy Markdown
Collaborator

This PR addresses #151

Proposes changes:

  • Merge dense + sparse chunk + shard params to single argument (zarr params get automically scaled by nnz elements)
  • Allow user to provide a zarr_shard_size in terms of GB/MB

@felix0097 felix0097 requested a review from ilan-gold March 6, 2026 14:21
Comment thread src/annbatch/io.py Outdated
raise ValueError(f"Cannot parse size string: {size!r}. Expected units: {', '.join(SIZE_UNITS)}")


def _resolve_shard_obs(shard_size: int | str, elem, iospec: ad.experimental.IOSpec) -> int:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilan-gold does this cover all cases for an anndata?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look how we do things in anndata: https://github.com/scverse/anndata/blob/a0c428379167741833eae806b18e7bda4af2b997/src/anndata/_core/anndata.py#L516-L527 - I would probably make this based on the actual class i.e., elem instead of iospec

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 84.21053% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.53%. Comparing base (8b541bf) to head (33c797e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/annbatch/io.py 84.21% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #156      +/-   ##
==========================================
- Coverage   93.89%   91.53%   -2.37%     
==========================================
  Files          11       11              
  Lines         852      886      +34     
==========================================
+ Hits          800      811      +11     
- Misses         52       75      +23     
Files with missing lines Coverage Δ
src/annbatch/io.py 92.17% <84.21%> (-0.95%) ⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/annbatch/io.py Outdated
raise ValueError(f"Cannot parse size string: {size!r}. Expected units: {', '.join(SIZE_UNITS)}")


def _resolve_shard_obs(shard_size: int | str, elem, iospec: ad.experimental.IOSpec) -> int:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look how we do things in anndata: https://github.com/scverse/anndata/blob/a0c428379167741833eae806b18e7bda4af2b997/src/anndata/_core/anndata.py#L516-L527 - I would probably make this based on the actual class i.e., elem instead of iospec

Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py Outdated
elif iospec.encoding_type in {"csr_matrix", "csc_matrix"}:
nnz = elem.nnz
avg_nnz = nnz / elem.shape[0] if elem.shape[0] > 0 else 1.0
sparse_chunk = max(1, int(chunk_size * avg_nnz))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should error on 0-sized sparse arrays along the first dimension or may the else branch here just the full size of the array. Is it even possible to have elem.shape[0] here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can chunk_size * avg_nnz ever be 0? If the data is empty?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if the matrix is empty. This is actually a relatively common use case, e.g. you don't have an X but anndata requires you to have it

Comment thread src/annbatch/io.py Outdated
return num - (num % divisor)


def _parse_size_to_bytes(size: str) -> int:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ilan-gold
Copy link
Copy Markdown
Collaborator

Also this needs a changelog entry and passing tests

@felix0097 felix0097 added the skip-gpu-ci Whether gpu ci should be skipped label Mar 6, 2026
@felix0097 felix0097 self-assigned this Mar 6, 2026
@felix0097 felix0097 requested a review from ilan-gold March 6, 2026 15:32
@felix0097
Copy link
Copy Markdown
Collaborator Author

I've addressed all your comments @ilan-gold

Comment thread CHANGELOG.md Outdated

- {class}`annbatch.DatasetCollection` now accepts a `rng` argument to the {meth}`annbatch.DatasetCollection.add_adatas` method.
- The ``sparse_chunk_size``, ``sparse_shard_size``, ``dense_chunk_size``, and ``dense_shard_size`` parameters of {func}`annbatch.write_sharded` have been replaced by ``chunk_size`` (number of observations per chunk, automatically converted to element counts for sparse arrays) and ``shard_size`` (number of observations per shard or a size string). The corresponding parameters in {meth}`annbatch.DatasetCollection.add_adatas` are ``zarr_chunk_size`` and ``zarr_shard_size``.
- `zarr_shard_size` in {meth}`annbatch.DatasetCollection.add_adatas` and `shard_size` in {func}`annbatch.write_sharded` now accept a human-readable size string (e.g. ``'1GB'``, ``'512MB'``) in addition to an integer observation count. When a string is provided, the observation count is derived independently for each array element from its uncompressed bytes-per-row so that every shard stays close to the target size.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would note that the integer does not represent bytes but number of obs. The fact that this must be noted might be a good reason to just go with bytes and not obs (zarr-python does this)

Comment thread src/annbatch/io.py Outdated
elif iospec.encoding_type in {"csr_matrix", "csc_matrix"}:
nnz = elem.nnz
avg_nnz = nnz / elem.shape[0] if elem.shape[0] > 0 else 1.0
sparse_chunk = max(1, int(chunk_size * avg_nnz))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can chunk_size * avg_nnz ever be 0? If the data is empty?

Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py Outdated
sparse_shard_size: int = 134_217_728,
dense_chunk_size: int = 1024,
dense_shard_size: int = 4194304,
chunk_size: int = 64,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
chunk_size: int = 64,
obs_per_chunk: int = 64,

and elsewhere! not just the doc string :)

felix0097 and others added 4 commits March 9, 2026 14:13
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
@felix0097 felix0097 requested a review from ilan-gold March 9, 2026 13:36
Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py Outdated
Comment thread src/annbatch/io.py Outdated
@felix0097 felix0097 requested a review from ilan-gold March 11, 2026 13:16
Copy link
Copy Markdown
Collaborator

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hold off on merging until we've got all of our breaking changes in a row and then we'll merge and release

Comment thread CHANGELOG.md Outdated
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
@felix0097 felix0097 merged commit b4b13cc into main Mar 12, 2026
12 checks passed
@felix0097 felix0097 deleted the ff/refactor-zarr-params branch March 12, 2026 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-gpu-ci Whether gpu ci should be skipped

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants