Skip to content

Improving Scan_Vocab speed, build_vocab_from_freq function. Iteration 2#1695

Merged
menshikh-iv merged 21 commits into
piskvorky:developfrom
jodevak:build_vocab_freq
Nov 8, 2017
Merged

Improving Scan_Vocab speed, build_vocab_from_freq function. Iteration 2#1695
menshikh-iv merged 21 commits into
piskvorky:developfrom
jodevak:build_vocab_freq

Conversation

@jodevak
Copy link
Copy Markdown
Contributor

@jodevak jodevak commented Nov 6, 2017

As requested, this is a new pull request. Thanks

Comment thread gensim/test/test_word2vec.py Outdated
testfile(), binary=True, datatype=np.float16
)
self.assertEqual(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2)
self.assertEquals(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright

Comment thread gensim/test/test_word2vec.py Outdated
["minors", "survey", "minors", "survey", "minors"]
]
model = word2vec.Word2Vec(sentences, size=10, min_count=0, max_vocab_size=2, seed=42, hs=1, negative=0)
self.assertTrue(len(model.wv.vocab), 3)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you need assertEqual, not assertTrue, don't you?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

Comment thread gensim/models/word2vec.py Outdated
.. [#taddy] Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
.. [#deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb
.. [taddy] Taddy, Matt. Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
.. [deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but why are you remove # in citate ? (#1633)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autopep8 tool did

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is merged with an older version.

Comment thread gensim/models/word2vec.py Outdated
Examples
--------
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
>>> model.build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: model.build_vocab_from_freq({"Word1": 15, "Word2": 20}, update=True)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, whats the problem with this ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces after :, , (in comment fixed variant)

Comment thread gensim/models/word2vec.py

Examples
--------
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model is undefined, please create model first (docstring should be executable, i.e. I can copy-paste this code to console and I expect that code run successfully) we plan to add doctests to our CI soon.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread gensim/models/word2vec.py

self.corpus_count = corpus_count if corpus_count else 0
self.raw_vocab = vocab
self.corpus_count = corpus_count if corpus_count else 0 # Since no sentences are provided, this is to control the corpus_count
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 - two spaces before #

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are 2 space, arent they ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, really, sorry

Comment thread gensim/models/word2vec.py Outdated
self.raw_vocab = raw_vocab

self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update) # trim by min_count & precalculate downsampling
self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule,update=update) # trim by min_count & precalculate downsampling
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return previous variant

Comment thread gensim/models/word2vec.py Outdated
)
checked_string_types += 1
if sentence_no % progress_per == 0:
if sentence_no % progress_per == 0 and sentence_no != 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this need?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because 0% anything will equal to 0; so the logger will log a statement saying sentence 0 and processed 0.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we want that :)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we want that :)

Copy link
Copy Markdown
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test based on #1599 (comment) and fix log message based on this comment - #1599 (comment)

After this - I'll merge your PR

Comment thread gensim/models/word2vec.py Outdated
--------
>>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
>>> from gensim.models.word2vec import Word2Vec
>>> model=Word2Vec()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 model = Word2Vec()

@jodevak
Copy link
Copy Markdown
Contributor Author

jodevak commented Nov 6, 2017

@menshikh-iv function testPruneVocab is already there .

@menshikh-iv
Copy link
Copy Markdown
Contributor

@jodevak need add test for total_words, because you change "counting logic"

@jodevak
Copy link
Copy Markdown
Contributor Author

jodevak commented Nov 6, 2017

@menshikh-iv Do you have any suggestions to test total_words, other than adding new attributes to the model object nor returning total_words as a value ?

@menshikh-iv
Copy link
Copy Markdown
Contributor

I see 3 variants

  1. Test for logger output (strange way, but why not)
  2. Make _total_words attr & check it after build_vocab
  3. Return total_words from build_vocab

@piskvorky what's variant looks best for you?

@jodevak
Copy link
Copy Markdown
Contributor Author

jodevak commented Nov 7, 2017

@menshikh-iv Choice 3 seems most convenient to me.

@menshikh-iv
Copy link
Copy Markdown
Contributor

Thank you @jodevak 👍

@menshikh-iv menshikh-iv merged commit 40b0417 into piskvorky:develop Nov 8, 2017
VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this pull request Nov 26, 2017
…#1695)

* fix build vocab speed issue, and new function to build vocab from previously provided word frequencies table

* fix build vocab speed issue, function build vocab from previously provided word frequencies table

* fix build vocab speed issue, function build vocab from previously provided word frequencies table

* fix build vocab speed issue, function build vocab from previously provided word frequencies table

* Removing the extra blank lines, documentation in numpy-style to build_vocab_from_freq, and hanging indents in build_vocab

* Fixing Indentation

* Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whitespace

* Remove trailing white spaces

* Adding test

* fix spaces

* iteration 2 on code

* iteration 2 on code

* Fixing old version of word2vec.py merge problems

* Fixing indent

* Fixing Styling

* Fixing Styling

* test

* test

* adding total words count test

* adding total words count test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants