Skip to content

Upgrade get_dataset.tokenize() to multiprocessing#24

Open
DrStoop wants to merge 2 commits intohuggingface:masterfrom
DrStoop:integration
Open

Upgrade get_dataset.tokenize() to multiprocessing#24
DrStoop wants to merge 2 commits intohuggingface:masterfrom
DrStoop:integration

Conversation

@DrStoop
Copy link
Copy Markdown

@DrStoop DrStoop commented Aug 20, 2019

get_dataset.tokenize() on a single CPU is very slow. Therefore in this pull request it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with
logger.debug() and mp_logger.debug() message to track progress in the python console.

get_dataset.tokenize() is to slow on a single CPU. Therefore it is upgraded to multiprocessing by implementing the
multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger
was added together with logger.debug() and mp_logger.debug() message to track progress in the python console.
Comment thread utils.py
Comment thread utils.py
@thomwolf
Copy link
Copy Markdown
Member

Looks nice, thanks!

Copy link
Copy Markdown
Author

@DrStoop DrStoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing, very nice project, happy you published it :) If there's anything else, let me know...

Comment thread utils.py
tokenize.dict_key_calls = 0

dataset = tokenize(dataset)
# dataset = tokenize(dataset)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely!

Suggested change
# dataset = tokenize(dataset)

Comment thread utils.py

personachat = tokenize(personachat)
torch.save(personachat, dataset_cache)
# torch.save(personachat, dataset_cache)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course!

Suggested change
# torch.save(personachat, dataset_cache)
torch.save(personachat, dataset_cache)

@DrStoop
Copy link
Copy Markdown
Author

DrStoop commented Aug 20, 2019

The question would be, if multiprocessing module should be added to requirements.txt?

@martinritchie
Copy link
Copy Markdown

@thomwolf , please could we get this merged? Thank you.

@DrStoop
Copy link
Copy Markdown
Author

DrStoop commented Sep 18, 2019

@thomwolf, before merging: i did some work on parallelizing the complete preprocessing chain affecting quite some code in ‚train.py‘ and ‚utils.py‘. i could clean the code & create a new pull request with e.g. 2 new files ‚utils_multiprocessing.py‘ and ‚train_multiprocessing.py‘. This way merging would become very easy & backward compatibility for everybody is guaranteed. Just let me know if you have interest in merging such a speedup ⏩ 💨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants