Feat: Clean up params for zarr writing#156
Conversation
| raise ValueError(f"Cannot parse size string: {size!r}. Expected units: {', '.join(SIZE_UNITS)}") | ||
|
|
||
|
|
||
| def _resolve_shard_obs(shard_size: int | str, elem, iospec: ad.experimental.IOSpec) -> int: |
There was a problem hiding this comment.
@ilan-gold does this cover all cases for an anndata?
There was a problem hiding this comment.
Have a look how we do things in anndata: https://github.com/scverse/anndata/blob/a0c428379167741833eae806b18e7bda4af2b997/src/anndata/_core/anndata.py#L516-L527 - I would probably make this based on the actual class i.e., elem instead of iospec
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #156 +/- ##
==========================================
- Coverage 93.89% 91.53% -2.37%
==========================================
Files 11 11
Lines 852 886 +34
==========================================
+ Hits 800 811 +11
- Misses 52 75 +23
🚀 New features to boost your workflow:
|
| raise ValueError(f"Cannot parse size string: {size!r}. Expected units: {', '.join(SIZE_UNITS)}") | ||
|
|
||
|
|
||
| def _resolve_shard_obs(shard_size: int | str, elem, iospec: ad.experimental.IOSpec) -> int: |
There was a problem hiding this comment.
Have a look how we do things in anndata: https://github.com/scverse/anndata/blob/a0c428379167741833eae806b18e7bda4af2b997/src/anndata/_core/anndata.py#L516-L527 - I would probably make this based on the actual class i.e., elem instead of iospec
| elif iospec.encoding_type in {"csr_matrix", "csc_matrix"}: | ||
| nnz = elem.nnz | ||
| avg_nnz = nnz / elem.shape[0] if elem.shape[0] > 0 else 1.0 | ||
| sparse_chunk = max(1, int(chunk_size * avg_nnz)) |
There was a problem hiding this comment.
I think we should error on 0-sized sparse arrays along the first dimension or may the else branch here just the full size of the array. Is it even possible to have elem.shape[0] here?
There was a problem hiding this comment.
How can chunk_size * avg_nnz ever be 0? If the data is empty?
There was a problem hiding this comment.
yes, if the matrix is empty. This is actually a relatively common use case, e.g. you don't have an X but anndata requires you to have it
| return num - (num % divisor) | ||
|
|
||
|
|
||
| def _parse_size_to_bytes(size: str) -> int: |
There was a problem hiding this comment.
https://pypi.org/project/humanfriendly/ might be more robust
|
Also this needs a changelog entry and passing tests |
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
…oaders into ff/refactor-zarr-params
|
I've addressed all your comments @ilan-gold |
|
|
||
| - {class}`annbatch.DatasetCollection` now accepts a `rng` argument to the {meth}`annbatch.DatasetCollection.add_adatas` method. | ||
| - The ``sparse_chunk_size``, ``sparse_shard_size``, ``dense_chunk_size``, and ``dense_shard_size`` parameters of {func}`annbatch.write_sharded` have been replaced by ``chunk_size`` (number of observations per chunk, automatically converted to element counts for sparse arrays) and ``shard_size`` (number of observations per shard or a size string). The corresponding parameters in {meth}`annbatch.DatasetCollection.add_adatas` are ``zarr_chunk_size`` and ``zarr_shard_size``. | ||
| - `zarr_shard_size` in {meth}`annbatch.DatasetCollection.add_adatas` and `shard_size` in {func}`annbatch.write_sharded` now accept a human-readable size string (e.g. ``'1GB'``, ``'512MB'``) in addition to an integer observation count. When a string is provided, the observation count is derived independently for each array element from its uncompressed bytes-per-row so that every shard stays close to the target size. |
There was a problem hiding this comment.
I would note that the integer does not represent bytes but number of obs. The fact that this must be noted might be a good reason to just go with bytes and not obs (zarr-python does this)
| elif iospec.encoding_type in {"csr_matrix", "csc_matrix"}: | ||
| nnz = elem.nnz | ||
| avg_nnz = nnz / elem.shape[0] if elem.shape[0] > 0 else 1.0 | ||
| sparse_chunk = max(1, int(chunk_size * avg_nnz)) |
There was a problem hiding this comment.
How can chunk_size * avg_nnz ever be 0? If the data is empty?
| sparse_shard_size: int = 134_217_728, | ||
| dense_chunk_size: int = 1024, | ||
| dense_shard_size: int = 4194304, | ||
| chunk_size: int = 64, |
There was a problem hiding this comment.
| chunk_size: int = 64, | |
| obs_per_chunk: int = 64, |
and elsewhere! not just the doc string :)
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
for more information, see https://pre-commit.ci
…oaders into ff/refactor-zarr-params
ilan-gold
left a comment
There was a problem hiding this comment.
Let's hold off on merging until we've got all of our breaking changes in a row and then we'll merge and release
Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
This PR addresses #151
Proposes changes:
zarr_shard_sizein terms of GB/MB