Generate Deprecated exception when using Word2Vec.load_word2vec_format#1165
Conversation
gojomo
left a comment
There was a problem hiding this comment.
Restored branch to be able to leave line-specific comments.
| NOTE: document vectors are not loaded/saved with .load/save_word2vec_format(). Use .save()/.load() instead. | ||
| If you're finished training a model (=no more updates, only querying), you can do | ||
|
|
||
| >>> model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True): |
There was a problem hiding this comment.
I believe this will also break inference, so comment should mention that too.
There was a problem hiding this comment.
Isn't that what the keep_inference=True is for?
There was a problem hiding this comment.
Inference is preserved. It is tested in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_doc2vec.py#L319
There was a problem hiding this comment.
I see, though this is still kind of odd. As called in this prominent example, this method hardly gets rid of anything – just the relatively-tiny doctag_syn0_lockf. Someone who just needs that tiny benefit could be coached to execute del model.docvecs.doctag_syn0_lockf. (I fear here, and to some extend on Word2Vec too, this method is attractive to novices but likely to cause headaches for them and then support/maintenance issues down the road.)
| .. [3] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. | ||
| In Proceedings of NIPS, 2013. | ||
| .. [blog] Optimizing word2vec in gensim, http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/ | ||
| .. [tutorial] Doc2vec in gensim tutorial, http://radimrehurek.com/2013/09/word2vec-in-python-part-two-optimizing/ |
|
|
||
| The word vectors can also be instantiated from an existing file on disk in the word2vec C format as a KeyedVectors instance:: | ||
|
|
||
| NOTE: It is impossible to continue training the vectors loaded from the C format because the binary tree is missing. |
There was a problem hiding this comment.
Not just the binary tree (which is only used in hs mode), but the hidden-weights and vocabulary-frequency information are missing.
| If you're finished training a model (=no more updates, only querying), you can do | ||
|
|
||
| >>> model.init_sims(replace=True) | ||
| >>> model.delete_temporary_training_data(replace_word_vectors_with_normalized=True) |
There was a problem hiding this comment.
With KeyedVectors now the recommended form for read-only access, perhaps the proper recommendation for "if you're sure you're done training" is to discard the Word2Vec model instance entirely, and just retain the KeyedVectors.
There was a problem hiding this comment.
Of course. This is some weird mix, "the worst of both world", complicating the API and confusing people.
| where "words" are actually multiword expressions, such as `new_york_times` or `financial_crisis`: | ||
|
|
||
| >>> bigram_transformer = gensim.models.Phrases(sentences) | ||
| >>> bigram_transformer = gensim.models.Phraser(gensim.models.Phrases(sentences)) |
There was a problem hiding this comment.
Personally I might not recommend all users prefer Phraser without understanding the extra steps it requires, because of the extra time of the reduction-pass, and the fact it throws out some info (in Phrases) that was expensive to collect and allow experimentation with different count/threshold values.
| super(Word2Vec, self)._load_specials(*args, **kwargs) | ||
|
|
||
| @classmethod | ||
| def load_word2vec_format(cls, fname, fvocab=None, binary=False, encoding='utf8', unicode_errors='strict', |
There was a problem hiding this comment.
This now makes the call_on_class_only reference in __init__() superfluous/wrong.
Users have been thrown off by the Word2Vec.load_word2vec_format method disappearing without an obvious alternative. An Exception is now thrown directing to KeyedVectors.
Also docstrings and ipynbs updated with KeyedVectors changes.