Improving Scan_Vocab speed, build_vocab_from_freq function. Iteration 2#1695
Conversation
…viously provided word frequencies table
…vided word frequencies table
…vided word frequencies table
…vided word frequencies table
…_vocab_from_freq, and hanging indents in build_vocab
…into build_vocab_freq
| testfile(), binary=True, datatype=np.float16 | ||
| ) | ||
| self.assertEqual(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2) | ||
| self.assertEquals(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2) |
There was a problem hiding this comment.
https://docs.python.org/2/library/unittest.html#deprecated-aliases
assertEquals is deprecated.
| ["minors", "survey", "minors", "survey", "minors"] | ||
| ] | ||
| model = word2vec.Word2Vec(sentences, size=10, min_count=0, max_vocab_size=2, seed=42, hs=1, negative=0) | ||
| self.assertTrue(len(model.wv.vocab), 3) |
There was a problem hiding this comment.
maybe you need assertEqual, not assertTrue, don't you?
| .. [#taddy] Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics. | ||
| .. [#deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb | ||
| .. [taddy] Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics. | ||
| .. [deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb |
There was a problem hiding this comment.
I'm sorry, but why are you remove # in citate ? (#1633)
There was a problem hiding this comment.
this file is merged with an older version.
| Examples | ||
| -------- | ||
| >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) | ||
| >>> model.build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) |
There was a problem hiding this comment.
PEP8: model.build_vocab_from_freq({"Word1": 15, "Word2": 20}, update=True)
There was a problem hiding this comment.
sorry, whats the problem with this ?
There was a problem hiding this comment.
spaces after :, , (in comment fixed variant)
|
|
||
| Examples | ||
| -------- | ||
| >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) |
There was a problem hiding this comment.
Model is undefined, please create model first (docstring should be executable, i.e. I can copy-paste this code to console and I expect that code run successfully) we plan to add doctests to our CI soon.
|
|
||
| self.corpus_count = corpus_count if corpus_count else 0 | ||
| self.raw_vocab = vocab | ||
| self.corpus_count = corpus_count if corpus_count else 0 # Since no sentences are provided, this is to control the corpus_count |
There was a problem hiding this comment.
PEP8 - two spaces before #
There was a problem hiding this comment.
These are 2 space, arent they ?
| self.raw_vocab = raw_vocab | ||
|
|
||
| self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling | ||
| self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule,update=update) # trim by min_count & precalculate downsampling |
There was a problem hiding this comment.
Return previous variant
| ) | ||
| checked_string_types += 1 | ||
| if sentence_no % progress_per == 0: | ||
| if sentence_no % progress_per == 0 and sentence_no != 0: |
There was a problem hiding this comment.
Because 0% anything will equal to 0; so the logger will log a statement saying sentence 0 and processed 0.
There was a problem hiding this comment.
Please add test based on #1599 (comment) and fix log message based on this comment - #1599 (comment)
After this - I'll merge your PR
| -------- | ||
| >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True) | ||
| >>> from gensim.models.word2vec import Word2Vec | ||
| >>> model=Word2Vec() |
There was a problem hiding this comment.
PEP8 model = Word2Vec()
|
@menshikh-iv function testPruneVocab is already there . |
|
@jodevak need add test for |
|
@menshikh-iv Do you have any suggestions to test total_words, other than adding new attributes to the model object nor returning total_words as a value ? |
|
I see 3 variants
@piskvorky what's variant looks best for you? |
|
@menshikh-iv Choice 3 seems most convenient to me. |
|
Thank you @jodevak 👍 |
…#1695) * fix build vocab speed issue, and new function to build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * Removing the extra blank lines, documentation in numpy-style to build_vocab_from_freq, and hanging indents in build_vocab * Fixing Indentation * Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whitespace * Remove trailing white spaces * Adding test * fix spaces * iteration 2 on code * iteration 2 on code * Fixing old version of word2vec.py merge problems * Fixing indent * Fixing Styling * Fixing Styling * test * test * adding total words count test * adding total words count test
As requested, this is a new pull request. Thanks