Skip to content

[Storage] writer: parallelization evolutions #288

@casenave

Description

@casenave

Investigate dask for simplified parallel writer.
Consider removing the N_SHARD arg: it complexifies the API, it is used only by hf_dataset, but not in the data processing: only at writing, and in the current code, the number of shards is automatically inferred.

For the HF_dataset support, one must keep the call to generator, it is possible with dask:

def plaid_sample_generator(bag):
    for d in bag.to_delayed():
        samples = dask.compute(d)[0]
        for s in samples:
            yield s

then

gen = {"train": partial(plaid_sample_generator, bag)}
plaid.storage.save_to_disk(output_folder = "./test", generators = gen)

where bag is a dask.bag object. I did not understand how parallelism is really controlled with this

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions