Wrapper for Varembed Models#1067
Conversation
|
I've added wrapper to load varembed model into gensim and allowing keyedvectors functionalities. |
5b78e5d to
6034e68
Compare
|
There seems to be a small issue specifically in the morfessor package. I've submitted a PR aalto-speech/morfessor#6 for the same. We'll probably have to wait to get it merged or find some other fix to get the code running in Python 2.6. Any suggestions? |
|
@tmylk Apart from the issue in Python 2.6, which I've already discussed. The code right now fails in other versions as well, which seems to me a problem in circular dependency in imports in morfessor. Though the tests run fine on my personal machine, it fails on Travis, which is something I don't understand. Any ideas, what's going wrong? |
tmylk
left a comment
There was a problem hiding this comment.
More tests, KeyedVectors and other small improvements needed.
| """ | ||
| Load the input-hidden weight matrix from the fast text output files. | ||
|
|
||
| Note that due to limitations in the FastText API, you cannot continue training |
There was a problem hiding this comment.
Please correct Docstring to be about varembed
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class VarEmbed(Word2Vec): |
There was a problem hiding this comment.
Please subclass KeyedVectors
| # Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
|
|
||
| """ | ||
| Automated tests for checking transformation algorithms (the models package). |
There was a problem hiding this comment.
Oh! Had missed this. Thanks. Done.
| """Test ensembling of Morhpeme Embeddings""" | ||
| model = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file, | ||
| morfessor_model=varembed_model_morfessor_file, use_morphemes=True) | ||
| self.model_sanity(model) |
There was a problem hiding this comment.
please test that syn0 is different compared to non-morpheme model.
| if use_morphemes: | ||
| try: | ||
| import morfessor | ||
| morfessor_model = morfessor.MorfessorIO().read_binary_model_file(morfessor_model) |
There was a problem hiding this comment.
could you raise an issue in varembed github as heads up that read_binary_model_file will be deprecated by morfessor?
There was a problem hiding this comment.
Yes, Sounds good. I've put up an issue rguthrie3/MorphologicalPriorsForWordEmbeddings#3 to notify them about the new release on morfessor as well.
…sor version in travis script
…support is only provided in python 2.7 and above. Also added more comments
c0ce65b to
ef17a12
Compare
ef17a12 to
bf57058
Compare
| word_embeddings = D['word_embeddings'] | ||
| morpho_embeddings = D['morpheme_embeddings'] | ||
| result.load_word_embeddings(word_embeddings, word_to_ix) | ||
| if use_morphemes: |
There was a problem hiding this comment.
just if morfessor_model is enough
There was a problem hiding this comment.
Yes, sounds good. Done.
| logger.info("Loaded matrix of %d size and %d dimensions", self.vocab_size, self.vector_size) | ||
|
|
||
|
|
||
| def ensemble_morpheme_embeddings(self, morfessor_model, morpho_embeddings, morpho_to_ix): |
There was a problem hiding this comment.
maybe add_morphemes_to_word_embeddings?
There was a problem hiding this comment.
Changed method name now.
4f5e359 to
3095620
Compare
|
Will there be a tutorial ipynb? |
|
Yes, I'll just add one. |
|
@tmylk I've now added a tutorial on varembed model as well. Please review once, and I feel we could merge it then. |
| Test only in Python 2.7 and above. Add Morphemes is not supported in earlier versions. | ||
| """ | ||
| model = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file) | ||
| model_with_morphemes = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file, |
There was a problem hiding this comment.
Code style: hanging indent please (not vertical indent).
|
|
||
| @unittest.skipUnless(sys.version_info < (2, 7), 'Test to check throwing exception in Python 2.6 and earlier') | ||
| def testAddMorphemesThrowsExceptionInPython26(self): | ||
| self.assertRaises(Exception, varembed.VarEmbed.load_varembed_format, vectors=varembed_model_vector_file, |
|
|
||
| This module allows ability to obtain word vectors for out-of-vocabulary words, for the Varembed model[2]. | ||
|
|
||
| The wrapped model can NOT be updated with new documents for online training -- use gensim's `Word2Vec` for that. |
There was a problem hiding this comment.
You mean that VarEmbed gensim wrapper doesn't support it? Also, someone might be confused that you are suggesting to load varembed and then train it as Word2Vec on new words which is incorrect
There was a problem hiding this comment.
Oh yes, you are right. It could be bit confusing earlier. Updated it now.
|
Thanks for adding the wrapper! It will be part of this week's release. Let's add a benchmark notebook and a blog in another PR. Will publicize when that is ready. |
|
Cool! Great! That's Awesome! :D |
This is an effort to integrate Varembed word-embedding Model into gensim.
To train varembed models, one can use the code put up by @rguthrie3 here.
Presently, the motive is to provide wrapper for loading varembed models with gensim and not training the word embeddings.
The first draft is based on Wordrank wrapper by @parulsethi
TODOs