Upgrade get_dataset.tokenize() to multiprocessing#24
Upgrade get_dataset.tokenize() to multiprocessing#24DrStoop wants to merge 2 commits intohuggingface:masterfrom
Conversation
get_dataset.tokenize() is to slow on a single CPU. Therefore it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with logger.debug() and mp_logger.debug() message to track progress in the python console.
|
Looks nice, thanks! |
DrStoop
left a comment
There was a problem hiding this comment.
Thanks for reviewing, very nice project, happy you published it :) If there's anything else, let me know...
| tokenize.dict_key_calls = 0 | ||
|
|
||
| dataset = tokenize(dataset) | ||
| # dataset = tokenize(dataset) |
There was a problem hiding this comment.
absolutely!
| # dataset = tokenize(dataset) |
|
|
||
| personachat = tokenize(personachat) | ||
| torch.save(personachat, dataset_cache) | ||
| # torch.save(personachat, dataset_cache) |
There was a problem hiding this comment.
of course!
| # torch.save(personachat, dataset_cache) | |
| torch.save(personachat, dataset_cache) |
|
The question would be, if |
|
@thomwolf , please could we get this merged? Thank you. |
|
@thomwolf, before merging: i did some work on parallelizing the complete preprocessing chain affecting quite some code in ‚train.py‘ and ‚utils.py‘. i could clean the code & create a new pull request with e.g. 2 new files ‚utils_multiprocessing.py‘ and ‚train_multiprocessing.py‘. This way merging would become very easy & backward compatibility for everybody is guaranteed. Just let me know if you have interest in merging such a speedup ⏩ 💨 |
get_dataset.tokenize() on a single CPU is very slow. Therefore in this pull request it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with
logger.debug() and mp_logger.debug() message to track progress in the python console.