Investigate dask for simplified parallel writer.
Consider removing the N_SHARD arg: it complexifies the API, it is used only by hf_dataset, but not in the data processing: only at writing, and in the current code, the number of shards is automatically inferred.
For the HF_dataset support, one must keep the call to generator, it is possible with dask:
def plaid_sample_generator(bag):
for d in bag.to_delayed():
samples = dask.compute(d)[0]
for s in samples:
yield s
then
gen = {"train": partial(plaid_sample_generator, bag)}
plaid.storage.save_to_disk(output_folder = "./test", generators = gen)
where bag is a dask.bag object. I did not understand how parallelism is really controlled with this
Investigate dask for simplified parallel writer.
Consider removing the
N_SHARDarg: it complexifies the API, it is used only byhf_dataset, but not in the data processing: only at writing, and in the current code, the number of shards is automatically inferred.For the HF_dataset support, one must keep the call to generator, it is possible with dask:
then
where
bagis adask.bagobject. I did not understand how parallelism is really controlled with this