Fix race condition in from_array for arrays with shards#3217
Fix race condition in from_array for arrays with shards#3217bojidar-bg wants to merge 2 commits intozarr-developers:mainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3217 +/- ##
==========================================
- Coverage 94.76% 94.76% -0.01%
==========================================
Files 78 78
Lines 8642 8648 +6
==========================================
+ Hits 8190 8195 +5
- Misses 452 453 +1
🚀 New features to boost your workflow:
|
| @@ -1265,13 +1287,19 @@ def _iter_chunk_keys( | |||
| yield self.metadata.encode_chunk_key(k) | |||
|
|
|||
| def _iter_chunk_regions( | |||
There was a problem hiding this comment.
I feel like _iter_chunk_regions should only iterate over the regions spanned by each chunk. Otherwise the name doesn't fit. So adding a flag to this function that makes it do something different (iterate over the regions spanned by each shard) seems worse than implementing a new _iter_shard_regions method, that does exactly what its name suggests.
There was a problem hiding this comment.
my general POV is that several well-defined functions is better than a smaller number of functions that try to do a lot. since this is private API, adding functions is cheap, so lets create new functions instead of adding functionality to existing ones in this case
There was a problem hiding this comment.
Hmm.. "private API" might be the magic word there 😂 There are no other users of _iter_chunk_regions (ignoring the a few unit tests); so- may I directly rename the function to _iter_shard_regions in this case? 😁
|
Superseded by #3299. |
Fixes #3169.
When passing data via
create_array(data = ...), the data is inserted byfrom_arrayafter getting split by chunks. In current main, when using shards,from_arrayends up splitting the data by sub-chunks (due toAsyncArray.chunksreturningMetadata.chunkswhich returns the size of the chunks inside shards instead of the size of the "physical" chunk files), which ends up creating a race condition when the writes to multiple sub-chunk end up writing to the same physical chunk file.This PR changes
from_arrayto instead split the data by shard in case there are shards.(Probably, there needs to be a lock somewhere in AsyncArray or ShardCodec which prevents multiple writes to the same shard (or, just physical chunk in general) to execute at the same time)
TODO:
docs/user-guide/*.rstchanges/