Add multiprocessing support for BM25#2146
Conversation
|
|
||
| def get_bm25_weights(corpus): | ||
| def _get_scores(bm25, document, average_idf): | ||
| """helper function for retrieving bm25 scores in parallel""" |
There was a problem hiding this comment.
Please use numpy-style docstrings (here and everywhere)
| elif n_jobs is None: | ||
| return 1 | ||
| elif n_jobs < 0: | ||
| n_jobs = max(cpu_count() + 1 + n_jobs, 1) |
There was a problem hiding this comment.
if this for n_jobs < 0 case, probably should be cpu_count() - 1, wdyt?
There was a problem hiding this comment.
@menshikh-iv I was trying to stick to sklearn style here, where n_jobs=-1 means using all cores. In fact, this way of determining the number of effective jobs was borrowed from sklearn which can be found here. Please let me know if you think this is ok.
There was a problem hiding this comment.
ok, that's fine (through cpu_count implementation different, but I don't worry about it)
| get_score = partial(_get_scores, bm25, average_idf=average_idf) | ||
| pool = Pool(n_processes) | ||
| weights = pool.map(get_score, corpus) | ||
| pool.close() |
There was a problem hiding this comment.
strange order, you close and join after, why?
There was a problem hiding this comment.
@menshikh-iv I came across this SO question a while ago and learned that one actually need to call close before using join. This can also be found in python's official docs.
| weights1 = get_bm25_weights(common_texts) | ||
| weights2 = get_bm25_weights(common_texts, n_jobs=2) | ||
| weights3 = get_bm25_weights(common_texts, n_jobs=-1) | ||
| self.assertEqual(weights1, weights2) |
There was a problem hiding this comment.
should be assertAlmostEqual instead (never compare floating point values using "strict" equal)
There was a problem hiding this comment.
@menshikh-iv That was my bad. Fixed now.
menshikh-iv
left a comment
There was a problem hiding this comment.
looks good @Shiki-H, please fix last review comments and I'll merge this PR
| return scores | ||
|
|
||
|
|
||
| def _effective_n_jobs(n_jobs): |
There was a problem hiding this comment.
Can you move this function to gensim.utils and rename it to effective_n_jobs (looks useful for later usage).
| Returns | ||
| ------- | ||
| int | ||
| number of effective jobs |
There was a problem hiding this comment.
n -> N + . at the end of sentence
|
Thanks @Shiki-H, congratz with first contribution 🥇 |
BM25
|
@menshikh-iv Thanks :) |
I realized that computing BM25 for large corpus can be quite time consuming, so I added multiprocessing support for the original BM25. I did not open an issue as this is a very quick fix and thought that I would probably make a PR directly. I also added a test to verify the result of using multiprocessing is identical to the original approach. Probably not many people use this function, but hopefully it helps :)