Skip to content

Add multiprocessing support for BM25#2146

Merged
menshikh-iv merged 15 commits into
piskvorky:developfrom
Shiki-H:bm25-multiprocessing
Aug 13, 2018
Merged

Add multiprocessing support for BM25#2146
menshikh-iv merged 15 commits into
piskvorky:developfrom
Shiki-H:bm25-multiprocessing

Conversation

@Shiki-H
Copy link
Copy Markdown
Contributor

@Shiki-H Shiki-H commented Aug 6, 2018

I realized that computing BM25 for large corpus can be quite time consuming, so I added multiprocessing support for the original BM25. I did not open an issue as this is a very quick fix and thought that I would probably make a PR directly. I also added a test to verify the result of using multiprocessing is identical to the original approach. Probably not many people use this function, but hopefully it helps :)

Comment thread gensim/summarization/bm25.py Outdated

def get_bm25_weights(corpus):
def _get_scores(bm25, document, average_idf):
"""helper function for retrieving bm25 scores in parallel"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use numpy-style docstrings (here and everywhere)

Comment thread gensim/summarization/bm25.py Outdated
elif n_jobs is None:
return 1
elif n_jobs < 0:
n_jobs = max(cpu_count() + 1 + n_jobs, 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this for n_jobs < 0 case, probably should be cpu_count() - 1, wdyt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv I was trying to stick to sklearn style here, where n_jobs=-1 means using all cores. In fact, this way of determining the number of effective jobs was borrowed from sklearn which can be found here. Please let me know if you think this is ok.

Copy link
Copy Markdown
Contributor

@menshikh-iv menshikh-iv Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that's fine (through cpu_count implementation different, but I don't worry about it)

get_score = partial(_get_scores, bm25, average_idf=average_idf)
pool = Pool(n_processes)
weights = pool.map(get_score, corpus)
pool.close()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange order, you close and join after, why?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv I came across this SO question a while ago and learned that one actually need to call close before using join. This can also be found in python's official docs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Comment thread gensim/test/test_BM25.py Outdated
weights1 = get_bm25_weights(common_texts)
weights2 = get_bm25_weights(common_texts, n_jobs=2)
weights3 = get_bm25_weights(common_texts, n_jobs=-1)
self.assertEqual(weights1, weights2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be assertAlmostEqual instead (never compare floating point values using "strict" equal)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv That was my bad. Fixed now.

Copy link
Copy Markdown
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good @Shiki-H, please fix last review comments and I'll merge this PR

Comment thread gensim/summarization/bm25.py Outdated
return scores


def _effective_n_jobs(n_jobs):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this function to gensim.utils and rename it to effective_n_jobs (looks useful for later usage).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv thanks. I have fixed them.

Comment thread gensim/summarization/bm25.py Outdated
Returns
-------
int
number of effective jobs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n -> N + . at the end of sentence

@menshikh-iv
Copy link
Copy Markdown
Contributor

Thanks @Shiki-H, congratz with first contribution 🥇

@menshikh-iv menshikh-iv changed the title BM25 with multiprocessing support Add multiprocessing support for BM25 Aug 13, 2018
@menshikh-iv menshikh-iv merged commit 466b32f into piskvorky:develop Aug 13, 2018
@Shiki-H
Copy link
Copy Markdown
Contributor Author

Shiki-H commented Aug 14, 2018

@menshikh-iv Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants