Skip to content

Fix #1196 - Gensim error when loading FastText#1214

Merged
tmylk merged 9 commits into
piskvorky:developfrom
jaksmid:develop
Mar 30, 2017
Merged

Fix #1196 - Gensim error when loading FastText#1214
tmylk merged 9 commits into
piskvorky:developfrom
jaksmid:develop

Conversation

@jaksmid
Copy link
Copy Markdown
Contributor

@jaksmid jaksmid commented Mar 14, 2017

Gensim can load large fasttext model on Mac
Loading of binary fasttext models is faster
Fasttext vector size is correctly set
Vector_size is also saved when loading just the vector_file, which can be useful if you are not interested in training

@tmylk
Copy link
Copy Markdown
Contributor

tmylk commented Mar 17, 2017

Travis tests re-ran after smart_open update

@tmylk
Copy link
Copy Markdown
Contributor

tmylk commented Mar 20, 2017

What is the purpose of adding a new attribute vector_size to all KeyedVectors? isn't it always syn0.shape[0]? If necessary, maybe just add it to FastText class?

@jaksmid
Copy link
Copy Markdown
Contributor Author

jaksmid commented Mar 21, 2017

Thanks @tmylk for the comment. To my understanding, you would use FastText in the case you want to load both the vec and bin files of the fasttext. In case you just want to load the vectors you can use FastTextKeyedVectors. As you pointed out, you can use syn0.shape[0] which may not be that straightforward if you are not familiar with the gensim internals. So I thought it can be a good idea to add vector_size explicitly so people can figure out the right name of the attribute more easily.

By adding it to the FastText class you meant FastTextKeyedVectors class, right?
Thanks!

EDIT: It seems like it should be self.syn0.shape[1] as seen in keyedvectors - save_word2vec_format which kind of proves my points that it is not that staightforward to get it right :)

Aslo, there is currenty no override of load_word2vec_format in the FastTextKeydVector, so unless we increase the complexity of FastTextKeyedVector by adding this override we have no place to initialize vector_size is FastTextKeyedVector.

Looking forward to your suggestions.

@jaksmid
Copy link
Copy Markdown
Contributor Author

jaksmid commented Mar 22, 2017

Furthermore, you cannot just use FastTextKeyedVectors without FastText initialization (which needs both vec and bin) as __contains__ in FastTextKeyedVectors references self.min_n which is actually set in FastText load_model_params.

Copy link
Copy Markdown
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just indent changes requested. Glad it became a comprehensive fix!
Please add a note in the changelog.md as well.

Comment thread gensim/test/test_fasttext_wrapper.py Outdated
# In vocab, sanity check
self.assertEqual(len(self.test_model.most_similar_cosmul(positive=['the', 'and'], topn=5)), 5)
self.assertEqual(self.test_model.most_similar_cosmul('the'), self.test_model.most_similar_cosmul(positive=['the']))
self.assertEqual(self.test_model.most_similar_cosmul('the'),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use hanging indent

Comment thread gensim/test/test_fasttext_wrapper.py Outdated
# Out of vocab check
self.assertEqual(len(self.test_model.most_similar_cosmul(['night', 'nights'], topn=5)), 5)
self.assertEqual(self.test_model.most_similar_cosmul('nights'), self.test_model.most_similar_cosmul(positive=['nights']))
self.assertEqual(self.test_model.most_similar_cosmul('nights'),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent preferred

@jaksmid
Copy link
Copy Markdown
Contributor Author

jaksmid commented Mar 23, 2017

Thanks for reviewing the code.
I incorporated the code review and merged with the current develop.

It seems like the build for python 2 stalled, could you rerun it @tmylk please?

@gojomo gojomo changed the title Issue 1196 Fix #1196 - Gensim error when loading FastText Mar 23, 2017
@tmylk
Copy link
Copy Markdown
Contributor

tmylk commented Mar 24, 2017

These tests are know to occasionally fail but it's the first time they fail constantly. Will disable them in the main branch soon.

@jaksmid
Copy link
Copy Markdown
Contributor Author

jaksmid commented Mar 24, 2017

Ok. Let me know if I can help with anything.

@tmylk tmylk merged commit ed49ea6 into piskvorky:develop Mar 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants