Upgrade get_dataset.tokenize() to multiprocessing by DrStoop · Pull Request #24 · huggingface/transfer-learning-conv-ai

DrStoop · 2019-08-20T03:52:19Z

get_dataset.tokenize() on a single CPU is very slow. Therefore in this pull request it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with
logger.debug() and mp_logger.debug() message to track progress in the python console.

get_dataset.tokenize() is to slow on a single CPU. Therefore it is upgraded to multiprocessing by implementing the multiprocessing target function worker_tokenize(args_list). Additionally a multiprocessing debug logger mp_logger was added together with logger.debug() and mp_logger.debug() message to track progress in the python console.

thomwolf · 2019-08-20T10:40:06Z

Looks nice, thanks!

DrStoop

Thanks for reviewing, very nice project, happy you published it :) If there's anything else, let me know...

DrStoop · 2019-08-20T12:38:45Z

+        tokenize.dict_key_calls = 0
+
        dataset = tokenize(dataset)
+        # dataset = tokenize(dataset)


absolutely!

Suggested change

# dataset = tokenize(dataset)

DrStoop · 2019-08-20T12:38:53Z

+
        personachat = tokenize(personachat)
-        torch.save(personachat, dataset_cache)
+        # torch.save(personachat, dataset_cache)


of course!

Suggested change

# torch.save(personachat, dataset_cache)

torch.save(personachat, dataset_cache)

DrStoop · 2019-08-20T14:37:16Z

The question would be, if multiprocessing module should be added to requirements.txt?

martinritchie · 2019-09-18T14:12:39Z

@thomwolf , please could we get this merged? Thank you.

DrStoop · 2019-09-18T22:01:17Z

@thomwolf, before merging: i did some work on parallelizing the complete preprocessing chain affecting quite some code in ‚train.py‘ and ‚utils.py‘. i could clean the code & create a new pull request with e.g. 2 new files ‚utils_multiprocessing.py‘ and ‚train_multiprocessing.py‘. This way merging would become very easy & backward compatibility for everybody is guaranteed. Just let me know if you have interest in merging such a speedup ⏩ 💨

thomwolf reviewed Aug 20, 2019

View reviewed changes

Comment thread utils.py

thomwolf reviewed Aug 20, 2019

View reviewed changes

Comment thread utils.py

DrStoop commented Aug 20, 2019

View reviewed changes

Merge branch 'master' into integration

a0631a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade get_dataset.tokenize() to multiprocessing#24

Upgrade get_dataset.tokenize() to multiprocessing#24
DrStoop wants to merge 2 commits intohuggingface:masterfrom
DrStoop:integration

DrStoop commented Aug 20, 2019

Uh oh!

Uh oh!

Uh oh!

thomwolf commented Aug 20, 2019

Uh oh!

DrStoop left a comment

Uh oh!

DrStoop Aug 20, 2019

Uh oh!

DrStoop Aug 20, 2019

Uh oh!

DrStoop commented Aug 20, 2019

Uh oh!

martinritchie commented Sep 18, 2019

Uh oh!

DrStoop commented Sep 18, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# torch.save(personachat, dataset_cache)
	torch.save(personachat, dataset_cache)

Conversation

DrStoop commented Aug 20, 2019

Uh oh!

Uh oh!

Uh oh!

thomwolf commented Aug 20, 2019

Uh oh!

DrStoop left a comment

Choose a reason for hiding this comment

Uh oh!

DrStoop Aug 20, 2019

Choose a reason for hiding this comment

Uh oh!

DrStoop Aug 20, 2019

Choose a reason for hiding this comment

Uh oh!

DrStoop commented Aug 20, 2019

Uh oh!

martinritchie commented Sep 18, 2019

Uh oh!

DrStoop commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DrStoop commented Sep 18, 2019 •

edited

Loading