Skip to content

Scikit-learn wrapper for FastText model#2178

Merged
menshikh-iv merged 7 commits into
piskvorky:developfrom
mcemilg:ft-wrapper-sklearn
Sep 19, 2018
Merged

Scikit-learn wrapper for FastText model#2178
menshikh-iv merged 7 commits into
piskvorky:developfrom
mcemilg:ft-wrapper-sklearn

Conversation

@mcemilg

@mcemilg mcemilg commented Sep 11, 2018

Copy link
Copy Markdown
Contributor

Fixes #2138

Added wrapper for FastText model to use on scikit-learn pipeline. I use word2vec and doc2vec wrappers as guidance on implementation.

@menshikh-iv menshikh-iv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @mcemilg! Please continue.

BTW I see than CI failed by unrelated reason. This will be fixed when we merged #2127 (after you also need to merge fresh develop to your branch)

>>>
>>> # What is the vector representation of the word 'graph'?
>>> wordvecs = model.fit(common_texts).transform(['graph', 'system'])
>>> assert wordvecs.shape == (2, 10)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more examples here (especially about "how to work with out-of-vocab words", this is the main use case of FastText)

Comment thread gensim/sklearn_api/ftmodel.py Outdated

Parameters
----------

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need empty line

batch_words : int, optional
Target size (in words) for batches of examples passed to worker threads (and
thus cython routines).(Larger batches will be passed if individual
texts are longer than 10000 words, but the standard cython code truncates to that maximum.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing empty line (at the end of docstring)

Comment thread gensim/test/test_sklearn_api.py Outdated

def testConsistencyWithGensimModel(self):
# training a FTTransformer
self.model = FTTransformer(size=10, min_count=0, seed=42)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for check this, you also need to pin workers=1 (for both models)

Comment thread gensim/test/test_sklearn_api.py Outdated
word = texts[0][0]
vec_transformer_api = self.model.transform(word) # vector returned by FTTransformer
vec_gensim_model = gensim_ftmodel[word] # vector returned by FastText
passed = numpy.allclose(vec_transformer_api, vec_gensim_model, atol=1e-1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1e-1 looks too large, why this needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw it on other consistency tests close to word vectors and I felt this is needed. Actually, it is passing without any tolerance parameter.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case please remove it

Comment thread gensim/test/test_sklearn_api.py Outdated
model_dump = pickle.dumps(self.model)
model_load = pickle.loads(model_dump)

word = texts[0][0]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass all corpus that you have + check with out-of-vocab words

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcemilg still here, please don't forget to fix

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixed it, thanks.

@mcemilg

mcemilg commented Sep 14, 2018

Copy link
Copy Markdown
Contributor Author

Hi for people I boder. There is a wrong git ops over here because of gitkraken. I will undo last changes I did. I am sorry.

@mcemilg

mcemilg commented Sep 16, 2018

Copy link
Copy Markdown
Contributor Author

I reset mistaken git commits and I pushed my last changes. Sorry again for trouble.

Comment thread gensim/test/test_sklearn_api.py Outdated

class TestFastTextWrapper(unittest.TestCase):
def setUp(self):
numpy.random.seed(0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy.random.seed(0) affect all interpreter in general (not only current test), that's bad practice (I think we have similar mistakes in existing tests), can you please remove all calls of numpy.random.seed in your code @mcemilg ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I removed the numpy.random.seed calls.

@menshikh-iv

Copy link
Copy Markdown
Contributor

Thanks @mcemilg, congratz with your first contribution 🥇

@menshikh-iv menshikh-iv merged commit 97783a4 into piskvorky:develop Sep 19, 2018
@mcemilg mcemilg deleted the ft-wrapper-sklearn branch September 20, 2018 18:12
@mcemilg

mcemilg commented Sep 20, 2018

Copy link
Copy Markdown
Contributor Author

Hi @menshikh-iv, thank you for your helps. Do you want me to fix this issue? If you want I can work on it.

@menshikh-iv

menshikh-iv commented Sep 20, 2018

Copy link
Copy Markdown
Contributor

@mcemilg If you have time for it - of course, I will be very grateful 🔥

@mcemilg

mcemilg commented Sep 20, 2018

Copy link
Copy Markdown
Contributor Author

Okay, I will look it as soon as possible. 👍

@menshikh-iv

Copy link
Copy Markdown
Contributor

@mcemilg note: global seeding should never happen (i.e have a look through all library, not test only).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants