Fix min_count handling in phrases detection using npmi by lopusz · Pull Request #2072 · piskvorky/gensim

lopusz · 2018-05-29T14:53:20Z

Dear Gensim Developers,

First of all, gensim is an amazing tool. Thank you for your work!

Recently, I played a bit with phrases (collocation) detection via gensim.

It seems that the npmi version ignores min_count parameter.
I believe min_count should allow me to filter rarely occurring bigrams no matter how strong the collocation score might be. It works fine with the default (original) scorer, but npmi ignores min_count parameter.

The attached example illustrates the point (sorry for txt ext, but github complains about py). It displays detected phrases, calls are with min_count=3. Before modification the NPMI version does not drop cases with 3 or fewer occurrences.

Default scoring
(b'A', b'A') (5, 0.18)
(b'B', b'A') (4, 0.1125)
NPMI scoring
(b'E', b'B') (1, 0.29377325990008535)
(b'A', b'E') (1, 0.2179885169004593)
(b'B', b'A') (4, -0.032919469600875044)
(b'A', b'A') (5, -0.03842191266041472)
(b'B', b'B') (2, -0.23155457855616865)
(b'A', b'B') (1, -0.48823822319945537)

After the correction, the low count bigrams are correctly pruned:

Default scoring
(b'A', b'A') (5, 0.18)
(b'B', b'A') (4, 0.1125)
NPMI scoring
(b'B', b'A') (4, -0.032919469600875044)
(b'A', b'A') (5, -0.03842191266041472)

If my way of reasoning is right, but you want me to add some love to the attached pull
request do not hesitate to let me know (perhaps stating explicitly in the docs that
scorer should handle the min_count would be a nice addition, in case somebody wants to write her/his own...)

Best regards,
Michał
mwe.txt

lopusz · 2018-07-13T15:11:25Z

This seems to be also the issue here
#2086

piskvorky · 2018-07-13T15:15:46Z

Thanks @lopusz ! Sorry it's taking longer to process, we were busy with the documentation updates.

This indeed looks like a bug to me. @michaelwsherman thoughts?

piskvorky · 2018-07-13T15:16:58Z

+    else:
+        # Return the value below minimal npmi, to make sure that phrases
+        # will be created only out of bigrams more frequent than min_count
+        return -1.1


What is the reasoning behind -1.1? Looks too magical.

Deserves a code comment at the very least, but maybe a principled value would be even better? -inf? None?

That value will be compared against the threshold parameter that was supplied by the user, which even though is not restricted, should fall within -1 and 1.

Also. I would use a >= instead of > in if bigram_count > min_count:

Good catch.

I'd be in favor of returning None (for all scoring functions) rather than a number for any min_count violation since that would mean all scorers return the same value that means "min_count not met". But I also don't know if None may break things. What tests fail with None? From a quick look this is probably safe.

I don't think returning -1.1 is a great idea either, but I agree about it being commented at the very least.

It seems you can't compare int/float with NoneType (at least in python 3.6.2), so line 168 of phrases.py would have to be something like if score is not None and score > threshold:. This would also require to change the default scorer to return None explicitly when min_count is not met, for consistency across scorers.

Other solution, instead of returning None in scorers, could be return -float('Inf') which would result in 'False' when comparing score > threshold.

I personally like solution 1 more.

I am sorry for a slow reply. The pull-request woke up in a pretty busy time for me.

Thank you for remarks. Indeed -inf sounds a bit less fragile-magical. Code updated.

Personally, I would avoid returning None from function expected to return a number and cluttering the rest of the code.

+1 to -Inf, I learned something :). LGTM.

lopusz · 2018-07-17T20:45:44Z

Code updated along the review lines. @piskvorky, Others do you find it mergeable?

piskvorky · 2018-07-18T08:21:00Z

@lopusz thanks! Let's wait for @menshikh-iv 's verdict.

lopusz · 2018-07-21T15:34:34Z

@menshikh-iv what do you think?

piskvorky · 2018-07-21T16:26:48Z

-    pb = wordb_count / corpus_word_count
-    pab = bigram_count / corpus_word_count
-    return log(pab / (pa * pb)) / -log(pab)
+    if bigram_count > min_count:


The sharp inequality goes against the established pattern in the other classes in this module, and in word2vec etc. Should be >= IMO.

… API

lopusz · 2018-07-23T16:40:33Z

@piskvorky Code updated.

However, I want to bikeshed a bit on that (If the stuff below looks like nonsense, just ignore it...)

I arrived at > sign because IMO gensim api for the default scorer seems to indirectly enforces > min_count in the default scorer. It would be great if default and npmi were compatible.

How is that enforced, you say?

Two things:

the score formula (bigram_count - min_count)/SOMETHING
Runtime check in the Phrases constructor enforcing threshold>0 (note sharp inequality)

   if threshold <= 0 and scoring == 'default':
         raise ValueError("threshold should be positive for default scoring")

Since, I am not allowed to threshold=0, the default scorer will always give me
bigrams > min_count.

Perhaps the check could enforce only that threshold>=0 for the default scoring?

How I arrived at this problem and why it might be important?

I wanted to generate all the phrases above certain min_count and compare
scores (ranking) provided by both methods for the same bigrams.
It seams reasonable that to have the option to score all
bigrams fulfilling min_count requirement. And indeed min_count should be
compatible with the rest of the library.

What do you think about allowing threshold>=0 for the default scoring?
@piskvorky @menshikh-iv @michaelwsherman @rafabr4

If I am the only bothered about that, just ignore.
Or maybe there are some negative side effects of this additional change I cannot see?

piskvorky · 2018-07-23T19:02:12Z

No problem :) Good points.

As a user, without reading the docs, I'd expect min_count to mean "minimum count to pass the filter", not "maximum count that still doesn't pass the filter".

The principle of least surprise.

If min_count means something else, in this PR or in the old code, we should either
a) change the variable name to something more appropriate, or
b) fix the code to match the expectation.

I'm in favour of b). But let's be consistent in the naming and the semantics, across all supported scorers.

menshikh-iv · 2018-07-30T12:53:43Z

@lopusz I'm +1 for #2072 (comment), i.e. should be >=.

Current code looks good, but please resolve merge conflict.

lopusz · 2018-07-30T15:15:14Z

Great.

Corrected doc string and comment to reflect the >= change + resolved the merge conflict.

lopusz · 2018-07-30T20:29:14Z

@menshikh-iv Do you think it is now mergeable?

menshikh-iv · 2018-07-31T07:24:00Z

@lopusz looks good, thanks! Congratz with first contribution 🥇

lopusz · 2018-07-31T11:43:04Z

@menshikh-iv :) Yes, my mergeytocin level has never been so high :)

I hope to be contributing more in the future.

menshikh-iv · 2018-07-31T11:47:58Z

@lopusz 🌟 great, you are welcome :)

Fix min_count handling in phrases detection using npmi

6d980d2

lopusz mentioned this pull request Jul 13, 2018

NPMI scorer does not take into account min_count #2086

Closed

piskvorky added the bug Issue described a bug label Jul 13, 2018

piskvorky requested changes Jul 13, 2018

View reviewed changes

Refactor min_count handling in npmi phrases detection

99dce42

piskvorky requested changes Jul 21, 2018

View reviewed changes

Fix min_count inequality for compatiblity with the rest of the gensim…

61cdbfc

… API

lopusz added 3 commits July 30, 2018 16:53

Fix misleading min_count doc_string

61fa8b2

Fix misleading min_count comment

8123dad

Merge branch 'develop' into develop-phrases

80c5b7e

menshikh-iv merged commit 4d921da into piskvorky:develop Jul 31, 2018

lopusz deleted the develop-phrases branch November 27, 2019 14:30

Uh oh!

Conversation

lopusz commented May 29, 2018

Uh oh!

lopusz commented Jul 13, 2018

Uh oh!

piskvorky commented Jul 13, 2018

Uh oh!

piskvorky Jul 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rafabr4 Jul 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelwsherman Jul 16, 2018

Choose a reason for hiding this comment

Uh oh!

rafabr4 Jul 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lopusz Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

michaelwsherman Jul 18, 2018

Choose a reason for hiding this comment

Uh oh!

lopusz commented Jul 17, 2018

Uh oh!

piskvorky commented Jul 18, 2018

Uh oh!

lopusz commented Jul 21, 2018

Uh oh!

piskvorky Jul 21, 2018

Choose a reason for hiding this comment

Uh oh!

lopusz commented Jul 23, 2018

Uh oh!

piskvorky commented Jul 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

menshikh-iv commented Jul 30, 2018

Uh oh!

lopusz commented Jul 30, 2018

Uh oh!

lopusz commented Jul 30, 2018

Uh oh!

menshikh-iv commented Jul 31, 2018

Uh oh!

lopusz commented Jul 31, 2018

Uh oh!

menshikh-iv commented Jul 31, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

piskvorky Jul 13, 2018 •

edited

Loading

rafabr4 Jul 13, 2018 •

edited

Loading

rafabr4 Jul 16, 2018 •

edited

Loading

piskvorky commented Jul 23, 2018 •

edited

Loading