Scikit-learn wrapper for FastText model by mcemilg · Pull Request #2178 · piskvorky/gensim

mcemilg · 2018-09-11T20:04:02Z

Added wrapper for FastText model to use on scikit-learn pipeline. I use word2vec and doc2vec wrappers as guidance on implementation.

menshikh-iv

Good work @mcemilg! Please continue.

BTW I see than CI failed by unrelated reason. This will be fixed when we merged #2127 (after you also need to merge fresh develop to your branch)

menshikh-iv · 2018-09-13T01:54:44Z

+>>>
+>>> # What is the vector representation of the word 'graph'?
+>>> wordvecs = model.fit(common_texts).transform(['graph', 'system'])
+>>> assert wordvecs.shape == (2, 10)


Need more examples here (especially about "how to work with out-of-vocab words", this is the main use case of FastText)

menshikh-iv · 2018-09-13T01:55:05Z

+
+        Parameters
+        ----------
+


No need empty line

menshikh-iv · 2018-09-13T01:56:11Z

+        batch_words : int, optional
+            Target size (in words) for batches of examples passed to worker threads (and
+            thus cython routines).(Larger batches will be passed if individual
+            texts are longer than 10000 words, but the standard cython code truncates to that maximum.)


missing empty line (at the end of docstring)

menshikh-iv · 2018-09-13T01:57:35Z

+
+    def testConsistencyWithGensimModel(self):
+        # training a FTTransformer
+        self.model = FTTransformer(size=10, min_count=0, seed=42)


for check this, you also need to pin workers=1 (for both models)

menshikh-iv · 2018-09-13T01:58:12Z

+        word = texts[0][0]
+        vec_transformer_api = self.model.transform(word)  # vector returned by FTTransformer
+        vec_gensim_model = gensim_ftmodel[word]  # vector returned by FastText
+        passed = numpy.allclose(vec_transformer_api, vec_gensim_model, atol=1e-1)


1e-1 looks too large, why this needed?

I saw it on other consistency tests close to word vectors and I felt this is needed. Actually, it is passing without any tolerance parameter.

In this case please remove it

menshikh-iv · 2018-09-13T01:59:13Z

+        model_dump = pickle.dumps(self.model)
+        model_load = pickle.loads(model_dump)
+
+        word = texts[0][0]


pass all corpus that you have + check with out-of-vocab words

@mcemilg still here, please don't forget to fix

Just fixed it, thanks.

mcemilg · 2018-09-14T23:42:19Z

Hi for people I boder. There is a wrong git ops over here because of gitkraken. I will undo last changes I did. I am sorry.

mcemilg · 2018-09-16T17:23:13Z

I reset mistaken git commits and I pushed my last changes. Sorry again for trouble.

menshikh-iv · 2018-09-18T01:09:51Z


+class TestFastTextWrapper(unittest.TestCase):
+    def setUp(self):
+        numpy.random.seed(0)


numpy.random.seed(0) affect all interpreter in general (not only current test), that's bad practice (I think we have similar mistakes in existing tests), can you please remove all calls of numpy.random.seed in your code @mcemilg ?

Okay, I removed the numpy.random.seed calls.

menshikh-iv · 2018-09-19T02:55:20Z

Thanks @mcemilg, congratz with your first contribution 🥇

mcemilg · 2018-09-20T18:37:57Z

Hi @menshikh-iv, thank you for your helps. Do you want me to fix this issue? If you want I can work on it.

menshikh-iv · 2018-09-20T18:40:56Z

@mcemilg If you have time for it - of course, I will be very grateful 🔥

mcemilg · 2018-09-20T18:46:13Z

Okay, I will look it as soon as possible. 👍

menshikh-iv · 2018-09-20T18:48:48Z

@mcemilg note: global seeding should never happen (i.e have a look through all library, not test only).

mcemilg added 2 commits September 11, 2018 22:48

Add scikit-learn wrapper for fasttext model.

7832107

Add sklearn fasttext wrapper test.

dc4e721

menshikh-iv suggested changes Sep 13, 2018

View reviewed changes

Fix docstring.

735e690

mcemilg added 2 commits September 16, 2018 20:14

Add more examples.

b1df5ee

Add tests for oov words. Fix some tests.

d7a6a5b

Pass all corpus on persistence test.

a0d5993

menshikh-iv reviewed Sep 18, 2018

View reviewed changes

Remove numpy.random.seed calls.

7c23783

menshikh-iv merged commit 97783a4 into piskvorky:develop Sep 19, 2018

mcemilg deleted the ft-wrapper-sklearn branch September 20, 2018 18:12

Uh oh!

Conversation

mcemilg commented Sep 11, 2018

Uh oh!

menshikh-iv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcemilg commented Sep 14, 2018

Uh oh!

mcemilg commented Sep 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

menshikh-iv commented Sep 19, 2018

Uh oh!

mcemilg commented Sep 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

menshikh-iv commented Sep 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcemilg commented Sep 20, 2018

Uh oh!

menshikh-iv commented Sep 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mcemilg commented Sep 20, 2018 •

edited

Loading

menshikh-iv commented Sep 20, 2018 •

edited

Loading