Improving Scan_Vocab speed, build_vocab_from_freq function. Iteration 2 by jodevak · Pull Request #1695 · piskvorky/gensim

jodevak · 2017-11-06T10:58:24Z

As requested, this is a new pull request. Thanks

…viously provided word frequencies table

…vided word frequencies table

…_vocab_from_freq, and hanging indents in build_vocab

…espace

…into build_vocab_freq

horpto · 2017-11-06T11:00:29Z

            testfile(), binary=True, datatype=np.float16
        )
-        self.assertEqual(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2)
+        self.assertEquals(binary_model_kv.syn0.nbytes, half_precision_model_kv.syn0.nbytes * 2)


https://docs.python.org/2/library/unittest.html#deprecated-aliases

assertEquals is deprecated.

horpto · 2017-11-06T11:02:31Z

+            ["minors", "survey", "minors", "survey", "minors"]
+        ]
+        model = word2vec.Word2Vec(sentences, size=10, min_count=0, max_vocab_size=2, seed=42, hs=1, negative=0)
+        self.assertTrue(len(model.wv.vocab), 3)


maybe you need assertEqual, not assertTrue, don't you?

horpto · 2017-11-06T11:09:42Z

-        .. [#taddy] Taddy, Matt.  Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
-        .. [#deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb
+        .. [taddy] Taddy, Matt.  Document Classification by Inversion of Distributed Language Representations, in Proceedings of the 2015 Conference of the Association of Computational Linguistics.
+        .. [deepir] https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/deepir.ipynb


I'm sorry, but why are you remove # in citate ? (#1633)

autopep8 tool did

this file is merged with an older version.

menshikh-iv · 2017-11-06T13:52:47Z

        Examples
        --------
-        >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
+        >>> model.build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)


PEP8: model.build_vocab_from_freq({"Word1": 15, "Word2": 20}, update=True)

sorry, whats the problem with this ?

spaces after :, , (in comment fixed variant)

menshikh-iv · 2017-11-06T13:55:31Z


        Examples
        --------
-        >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)


Model is undefined, please create model first (docstring should be executable, i.e. I can copy-paste this code to console and I expect that code run successfully) we plan to add doctests to our CI soon.

menshikh-iv · 2017-11-06T13:55:58Z


-        self.corpus_count = corpus_count if corpus_count else 0
-        self.raw_vocab = vocab
+        self.corpus_count = corpus_count if corpus_count else 0  # Since no sentences are provided, this is to control the corpus_count


PEP8 - two spaces before #

These are 2 space, arent they ?

Oh, really, sorry

menshikh-iv · 2017-11-06T14:03:46Z

+        self.raw_vocab = raw_vocab

-        self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, update=update)  # trim by min_count & precalculate downsampling
+        self.scale_vocab(keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule,update=update)  # trim by min_count & precalculate downsampling


Return previous variant

menshikh-iv · 2017-11-06T14:07:27Z

                    )
                checked_string_types += 1
-            if sentence_no % progress_per == 0:
+            if sentence_no % progress_per == 0 and sentence_no != 0:


Why did this need?

Because 0% anything will equal to 0; so the logger will log a statement saying sentence 0 and processed 0.

But we want that :)

menshikh-iv

Please add test based on #1599 (comment) and fix log message based on this comment - #1599 (comment)

After this - I'll merge your PR

menshikh-iv · 2017-11-06T14:54:29Z

        --------
-        >>> build_vocab_from_freq({"Word1":15,"Word2":20}, update=True)
+        >>> from gensim.models.word2vec import Word2Vec
+        >>> model=Word2Vec()


PEP8 model = Word2Vec()

jodevak · 2017-11-06T15:11:24Z

@menshikh-iv function testPruneVocab is already there .

menshikh-iv · 2017-11-06T15:23:54Z

@jodevak need add test for total_words, because you change "counting logic"

jodevak · 2017-11-06T16:22:31Z

@menshikh-iv Do you have any suggestions to test total_words, other than adding new attributes to the model object nor returning total_words as a value ?

menshikh-iv · 2017-11-07T05:52:56Z

I see 3 variants

Test for logger output (strange way, but why not)
Make _total_words attr & check it after build_vocab
Return total_words from build_vocab

@piskvorky what's variant looks best for you?

jodevak · 2017-11-07T09:44:01Z

@menshikh-iv Choice 3 seems most convenient to me.

menshikh-iv · 2017-11-08T06:49:01Z

Thank you @jodevak 👍

…#1695) * fix build vocab speed issue, and new function to build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * Removing the extra blank lines, documentation in numpy-style to build_vocab_from_freq, and hanging indents in build_vocab * Fixing Indentation * Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whitespace * Remove trailing white spaces * Adding test * fix spaces * iteration 2 on code * iteration 2 on code * Fixing old version of word2vec.py merge problems * Fixing indent * Fixing Styling * Fixing Styling * test * test * adding total words count test * adding total words count test

jodevak added 13 commits September 25, 2017 17:47

fix build vocab speed issue, and new function to build vocab from pre…

3f30e1e

…viously provided word frequencies table

fix build vocab speed issue, function build vocab from previously pro…

c4f387e

…vided word frequencies table

fix build vocab speed issue, function build vocab from previously pro…

8abd58b

…vided word frequencies table

fix build vocab speed issue, function build vocab from previously pro…

8ec0433

…vided word frequencies table

Removing the extra blank lines, documentation in numpy-style to build…

b9f3a5f

…_vocab_from_freq, and hanging indents in build_vocab

Fixing Indentation

0a5e8d6

Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whit…

644fcad

…espace

Remove trailing white spaces

c91b4cb

Adding test

1e4ef3e

fix spaces

9ae7a84

iteration 2 on code

1e82811

iteration 2 on code

aa9227d

Merge branch 'build_vocab_freq' of https://github.com/jodevak/gensim …

e156b95

…into build_vocab_freq

horpto reviewed Nov 6, 2017

View reviewed changes

jodevak added 2 commits November 6, 2017 15:24

Fixing old version of word2vec.py merge problems

2066a2a

Fixing indent

62ed129

menshikh-iv suggested changes Nov 6, 2017

View reviewed changes

Fixing Styling

473d7e6

menshikh-iv suggested changes Nov 6, 2017

View reviewed changes

menshikh-iv mentioned this pull request Nov 6, 2017

Fix scan vocab speed issue, build vocab from provided word frequencies #1599

Merged

Fixing Styling

a65e36b

jodevak added 2 commits November 6, 2017 18:24

test

7f46a05

test

f744c4f

adding total words count test

6471164

adding total words count test

9bc6b78

menshikh-iv merged commit 40b0417 into piskvorky:develop Nov 8, 2017

Uh oh!

Conversation

jodevak commented Nov 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

menshikh-iv left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jodevak commented Nov 6, 2017

Uh oh!

menshikh-iv commented Nov 6, 2017

Uh oh!

jodevak commented Nov 6, 2017

Uh oh!

menshikh-iv commented Nov 7, 2017

Uh oh!

jodevak commented Nov 7, 2017

Uh oh!

menshikh-iv commented Nov 8, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

menshikh-iv left a comment •

edited

Loading