Fix critical issues in FastText#2313
Merged
Merged
Conversation
$ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 Read 0M words Number of words: 22 Number of labels: 0 Progress: 100.0% words/sec/thread: 209 lr: 0.000000 loss: 4.100698 eta: 0h0m -14m
this will make it easier to debug manually $ ~/src/fastText-0.1.0/fasttext cbow -input toy-data.txt -output toy-model -bucket 100 -dim 5 Read 0M words Number of words: 22 Number of labels: 0 Progress: 100.0% words/sec/thread: 199 lr: 0.000000 loss: 0.000000 eta: 0h0m
it cannot pass by design: training is non-deterministic, so conditions must be tightly controlled to guarantee reproducibility, and that is too much effort for a unit test
FastText
Contributor
|
Great work @mpenkov 🚀 what's missing before the merge
|
This is an internal method masquerading as a public one. There is no reason for anyone to call it. Removing it will have no effect on pickling/unpickling, as methods do not get serialized. Therefore, removing it is safe.
menshikh-iv
approved these changes
Jan 11, 2019
Owner
I'd prefer to have the same implementation as FastText. Reasons:
|
Contributor
|
Great, we conclude all stuff. Nice job @mpenkov 🔥 |
FastTextFastText
This was referenced Jan 11, 2019
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Current PR contains fixes for all critical bugs in our fasttext implementation:
train()multiple times in a row without any issues.In conclusion - this makes FastText in Gensim more reliable, and directly compatible with FB's FT implementation for OOV words and model persistence.
We also identified divergent behavior with the Facebook implementation. This behavior is caused by an optimization that uses a smaller number of buckets than available. The manifestation is that if we compare two models:
then 1) will have fewer vectors than 2). As a consequence, vectors for OOV terms between the models will differ. This behavior is captured in our unit tests as test_out_of_vocab_gensim.