Scikit-learn wrapper for FastText model#2178
Conversation
| >>> | ||
| >>> # What is the vector representation of the word 'graph'? | ||
| >>> wordvecs = model.fit(common_texts).transform(['graph', 'system']) | ||
| >>> assert wordvecs.shape == (2, 10) |
There was a problem hiding this comment.
Need more examples here (especially about "how to work with out-of-vocab words", this is the main use case of FastText)
|
|
||
| Parameters | ||
| ---------- | ||
|
|
| batch_words : int, optional | ||
| Target size (in words) for batches of examples passed to worker threads (and | ||
| thus cython routines).(Larger batches will be passed if individual | ||
| texts are longer than 10000 words, but the standard cython code truncates to that maximum.) |
There was a problem hiding this comment.
missing empty line (at the end of docstring)
|
|
||
| def testConsistencyWithGensimModel(self): | ||
| # training a FTTransformer | ||
| self.model = FTTransformer(size=10, min_count=0, seed=42) |
There was a problem hiding this comment.
for check this, you also need to pin workers=1 (for both models)
| word = texts[0][0] | ||
| vec_transformer_api = self.model.transform(word) # vector returned by FTTransformer | ||
| vec_gensim_model = gensim_ftmodel[word] # vector returned by FastText | ||
| passed = numpy.allclose(vec_transformer_api, vec_gensim_model, atol=1e-1) |
There was a problem hiding this comment.
1e-1 looks too large, why this needed?
There was a problem hiding this comment.
I saw it on other consistency tests close to word vectors and I felt this is needed. Actually, it is passing without any tolerance parameter.
There was a problem hiding this comment.
In this case please remove it
| model_dump = pickle.dumps(self.model) | ||
| model_load = pickle.loads(model_dump) | ||
|
|
||
| word = texts[0][0] |
There was a problem hiding this comment.
pass all corpus that you have + check with out-of-vocab words
There was a problem hiding this comment.
@mcemilg still here, please don't forget to fix
There was a problem hiding this comment.
Just fixed it, thanks.
|
Hi for people I boder. There is a wrong git ops over here because of gitkraken. I will undo last changes I did. I am sorry. |
|
I reset mistaken git commits and I pushed my last changes. Sorry again for trouble. |
|
|
||
| class TestFastTextWrapper(unittest.TestCase): | ||
| def setUp(self): | ||
| numpy.random.seed(0) |
There was a problem hiding this comment.
numpy.random.seed(0) affect all interpreter in general (not only current test), that's bad practice (I think we have similar mistakes in existing tests), can you please remove all calls of numpy.random.seed in your code @mcemilg ?
There was a problem hiding this comment.
Okay, I removed the numpy.random.seed calls.
|
Thanks @mcemilg, congratz with your first contribution 🥇 |
|
Hi @menshikh-iv, thank you for your helps. Do you want me to fix this issue? If you want I can work on it. |
|
@mcemilg If you have time for it - of course, I will be very grateful 🔥 |
|
Okay, I will look it as soon as possible. 👍 |
|
@mcemilg note: global seeding should never happen (i.e have a look through all library, not |
Fixes #2138
Added wrapper for FastText model to use on scikit-learn pipeline. I use word2vec and doc2vec wrappers as guidance on implementation.