[Storage] writer: parallelization evolutions

Investigate dask for simplified parallel writer.
Consider removing the `N_SHARD` arg: it complexifies the API, it is used only by `hf_dataset`, but not in the data processing: only at writing, and in the current code, the number of shards is automatically inferred.

For the HF_dataset support, one must keep the call to generator, it is possible with dask:
```python
def plaid_sample_generator(bag):
    for d in bag.to_delayed():
        samples = dask.compute(d)[0]
        for s in samples:
            yield s
```
then
```python
gen = {"train": partial(plaid_sample_generator, bag)}
plaid.storage.save_to_disk(output_folder = "./test", generators = gen)
```
where `bag` is a `dask.bag` object. I did not understand how parallelism is really controlled with this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Storage] writer: parallelization evolutions #288

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Storage] writer: parallelization evolutions #288

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions