Make `word2vec2tensor` script compatible with `python3` by vsocrates · Pull Request #2147 · piskvorky/gensim

vsocrates · 2018-08-06T06:30:25Z

Fixes #1958. The word2vec2tensor script didn't support Python 3 due to unicode encoding. This was fixed and tests were added to ensure that the script functions correctly in the future. The usage of the script on the user side remains the same.

menshikh-iv

Thanks for PR @vsocrates, please continue 👍 you are on right way!

menshikh-iv · 2018-08-07T14:35:22Z


    """
-    model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_model_path, binary=binary)
+    model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_model_path, binary=binary, datatype=np.float64)


why np.float64 instead of np.float32 (that's default parameter)?

For some reason, when the word2vec model is loaded, if we leave the default, using the test data file word2vec_pre_kv_c I run into parsing issues. For instance, for the word the, I get array([-0.56110603, -1.97569799, 1.66395497, -1.23224604, 0.75475103, 0.98576403, 2.26144099, -0.59829003, -0.47433099, -1.41610503], dtype=float32) instead of -0.561106 -1.975698 1.663955 -1.232246 0.754751 0.985764 2.261441 -0.598290 -0.474331 -1.416105

And what's a problem here? I still don't catch, sorry

Sorry, should have been more clear. In the first dimension, as an example, the original word2vec model has -0.561106 but for some reason, without the np.float64 when it is read in, it displays -0.56110603, so the test fails equality. I'm not sure on the reason, though I assume something to do with the load_word2vec_format internals.

what's assertion you mean? can you link concrete line of code?

Also, you can fix assertion (slightly relax "almost equal" condition), but anyway, you shouldn't change dtype here.

menshikh-iv · 2018-08-07T14:37:41Z

+class TestWord2Vec2Tensor(unittest.TestCase):
+    def setUp(self):
+        self.datapath = datapath('word2vec_pre_kv_c')
+        self.output_folder = get_tmpfile('')


better to add an explicit name (for example w2v2t_test)

menshikh-iv · 2018-08-07T14:38:57Z

+    def testConversion(self):
+        word2vec2tensor(word2vec_model_path=self.datapath, tensor_filename=self.output_folder)
+
+        try:


No need try/except in current test. if exception raised - test failed, this is expected behavior.

Without try/except, should we give the developers more feedback and which part of test went wrong? If so, how would I do that without the except block?

Not in the current case, stack trace (if something goes wrong) have enough information in current tests.

menshikh-iv · 2018-08-07T14:40:15Z

+            first_line = f.readline().strip()
+
+        number_words, vector_size = map(int, first_line.split(b' '))
+        if not len(metadata) == len(vectors) == number_words:


assert <CONDITION>, <MESSAGE>

menshikh-iv · 2018-08-07T14:41:09Z

+        number_words, vector_size = map(int, first_line.split(b' '))
+        if not len(metadata) == len(vectors) == number_words:
+            self.fail(
+                'Metadata file %s and tensor file %s \


we using 120 character limit, no reasons to make short lines

menshikh-iv · 2018-08-07T14:44:58Z

+                imply different number of rows.' % (self.metadata_file, self.tensor_file)
+                )
+
+        # write word2vec to file


The strange part of the test, you copy part of the code from script to test this script, why?

I'm, sorry, could you be a bit more specific? Do you mean the lines from 154-157? If so, the idea was to take the metadata and tensor components that were separated and put the back together in w2v style without any gensim functions and make sure they are the same as what is created by the word2vec2tensor.py script. I guess it is basically just tested to_utf8 though. Is it fine to remove it?

I think it's fine to remove it, yes

vsocrates · 2018-08-11T21:25:56Z

@menshikh-iv The changes you mentioned have been made.

With regards to the float32 issue, I thought it'd be easier if I showed you the issue I was getting. The CI tests below show what happens when I don't include the float64 data type. The np.testing module tests to 7 decimal places, and what is in the original file vs. what is written to the two tensor files differs by less than 7 decimal places in the following assertion

menshikh-iv · 2018-08-13T04:12:41Z

@vsocrates ok, I see, simply "relax" decimal parameter of assert_almost_equal function (to 6 or 5), that's not an big deal (and I can merge current PR after CI are passed).

menshikh-iv · 2018-08-14T01:53:34Z

Thank you @vsocrates, congrats on the first contribution 🥇 !

vsocrates added 5 commits August 6, 2018 02:00

encoded strings to unicode

06ad270

added test scripts for word2vec2tensor

940ee83

added acknowledgement

ad6b52f

removed windows CRs

4c849dd

changed filenotfound to exception to appease flake8

2fb3ec2

menshikh-iv suggested changes Aug 7, 2018

View reviewed changes

menshikh-iv changed the title ~~W2v2tensor~~ Make word2vec2tensor script compatible with python3 Aug 10, 2018

vsocrates added 2 commits August 11, 2018 16:31

addressed comments, added key-wise assert

186d6e8

forgot to check flake8 again

9fc2ad6

added dec param to pass test

d94e41d

menshikh-iv merged commit 27c524d into piskvorky:develop Aug 14, 2018

vsocrates deleted the w2v2tensor branch August 14, 2018 02:43

Uh oh!

Conversation

vsocrates commented Aug 6, 2018

Uh oh!

menshikh-iv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vsocrates commented Aug 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

menshikh-iv commented Aug 13, 2018

Uh oh!

menshikh-iv commented Aug 14, 2018 • edited by piskvorky Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vsocrates commented Aug 11, 2018 •

edited

Loading

menshikh-iv commented Aug 14, 2018 •

edited by piskvorky

Loading