Fix min_count handling in phrases detection using npmi#2072
Conversation
|
This seems to be also the issue here |
|
Thanks @lopusz ! Sorry it's taking longer to process, we were busy with the documentation updates. This indeed looks like a bug to me. @michaelwsherman thoughts? |
| else: | ||
| # Return the value below minimal npmi, to make sure that phrases | ||
| # will be created only out of bigrams more frequent than min_count | ||
| return -1.1 |
There was a problem hiding this comment.
What is the reasoning behind -1.1? Looks too magical.
Deserves a code comment at the very least, but maybe a principled value would be even better? -inf? None?
There was a problem hiding this comment.
That value will be compared against the threshold parameter that was supplied by the user, which even though is not restricted, should fall within -1 and 1.
Also. I would use a >= instead of > in if bigram_count > min_count:
There was a problem hiding this comment.
Good catch.
I'd be in favor of returning None (for all scoring functions) rather than a number for any min_count violation since that would mean all scorers return the same value that means "min_count not met". But I also don't know if None may break things. What tests fail with None? From a quick look this is probably safe.
I don't think returning -1.1 is a great idea either, but I agree about it being commented at the very least.
There was a problem hiding this comment.
It seems you can't compare int/float with NoneType (at least in python 3.6.2), so line 168 of phrases.py would have to be something like if score is not None and score > threshold:. This would also require to change the default scorer to return None explicitly when min_count is not met, for consistency across scorers.
Other solution, instead of returning None in scorers, could be return -float('Inf') which would result in 'False' when comparing score > threshold.
I personally like solution 1 more.
There was a problem hiding this comment.
I am sorry for a slow reply. The pull-request woke up in a pretty busy time for me.
Thank you for remarks. Indeed -inf sounds a bit less fragile-magical. Code updated.
Personally, I would avoid returning None from function expected to return a number and cluttering the rest of the code.
There was a problem hiding this comment.
+1 to -Inf, I learned something :). LGTM.
|
Code updated along the review lines. @piskvorky, Others do you find it mergeable? |
|
@lopusz thanks! Let's wait for @menshikh-iv 's verdict. |
|
@menshikh-iv what do you think? |
| pb = wordb_count / corpus_word_count | ||
| pab = bigram_count / corpus_word_count | ||
| return log(pab / (pa * pb)) / -log(pab) | ||
| if bigram_count > min_count: |
There was a problem hiding this comment.
The sharp inequality goes against the established pattern in the other classes in this module, and in word2vec etc. Should be >= IMO.
|
@piskvorky Code updated. However, I want to bikeshed a bit on that (If the stuff below looks like nonsense, just ignore it...) I arrived at > sign because IMO gensim api for the default scorer seems to indirectly enforces > min_count in the default scorer. It would be great if default and npmi were compatible. How is that enforced, you say? Two things:
Since, I am not allowed to Perhaps the check could enforce only that How I arrived at this problem and why it might be important? I wanted to generate all the phrases above certain What do you think about allowing If I am the only bothered about that, just ignore. |
|
No problem :) Good points. As a user, without reading the docs, I'd expect The principle of least surprise. If I'm in favour of b). But let's be consistent in the naming and the semantics, across all supported scorers. |
|
@lopusz I'm +1 for #2072 (comment), i.e. should be Current code looks good, but please resolve merge conflict. |
|
Great. Corrected doc string and comment to reflect the |
|
@menshikh-iv Do you think it is now mergeable? |
|
@lopusz looks good, thanks! Congratz with first contribution 🥇 |
|
@menshikh-iv :) Yes, my mergeytocin level has never been so high :) I hope to be contributing more in the future. |
|
@lopusz 🌟 great, you are welcome :) |
Dear Gensim Developers,
First of all, gensim is an amazing tool. Thank you for your work!
Recently, I played a bit with phrases (collocation) detection via gensim.
It seems that the npmi version ignores
min_countparameter.I believe
min_countshould allow me to filter rarely occurring bigrams no matter how strong the collocation score might be. It works fine with the default (original) scorer, but npmi ignoresmin_countparameter.The attached example illustrates the point (sorry for txt ext, but github complains about py). It displays detected phrases, calls are with
min_count=3. Before modification the NPMI version does not drop cases with 3 or fewer occurrences.After the correction, the low count bigrams are correctly pruned:
If my way of reasoning is right, but you want me to add some love to the attached pull
request do not hesitate to let me know (perhaps stating explicitly in the docs that
scorer should handle the
min_countwould be a nice addition, in case somebody wants to write her/his own...)Best regards,
Michał
mwe.txt