Skip to content

Wrapper for Varembed Models#1067

Merged
tmylk merged 23 commits into
piskvorky:developfrom
anmolgulati:varembed-worker
Feb 6, 2017
Merged

Wrapper for Varembed Models#1067
tmylk merged 23 commits into
piskvorky:developfrom
anmolgulati:varembed-worker

Conversation

@anmolgulati
Copy link
Copy Markdown
Contributor

@anmolgulati anmolgulati commented Dec 30, 2016

This is an effort to integrate Varembed word-embedding Model into gensim.
To train varembed models, one can use the code put up by @rguthrie3 here.
Presently, the motive is to provide wrapper for loading varembed models with gensim and not training the word embeddings.

The first draft is based on Wordrank wrapper by @parulsethi

TODOs

  • Add tests for varembed wrapper.
  • Add model files for testing loading of wrapper.
  • Add tutorial in documentation.
  • Add RST files for Sphinx APIREF autogeneration.

@anmolgulati
Copy link
Copy Markdown
Contributor Author

I've added wrapper to load varembed model into gensim and allowing keyedvectors functionalities.
As the wrapper includes dependency over morfessor package. I've factored that out into a different method and ensemble the morpheme embeddings iff the user asks for ensembling the morpheme embeddings and morfessor package is present on the system.
@tmylk @piskvorky Please review.

@anmolgulati
Copy link
Copy Markdown
Contributor Author

anmolgulati commented Jan 29, 2017

There seems to be a small issue specifically in the morfessor package. I've submitted a PR aalto-speech/morfessor#6 for the same. We'll probably have to wait to get it merged or find some other fix to get the code running in Python 2.6. Any suggestions?

@anmolgulati
Copy link
Copy Markdown
Contributor Author

anmolgulati commented Jan 29, 2017

@tmylk Apart from the issue in Python 2.6, which I've already discussed. The code right now fails in other versions as well, which seems to me a problem in circular dependency in imports in morfessor. Though the tests run fine on my personal machine, it fails on Travis, which is something I don't understand. Any ideas, what's going wrong?

Copy link
Copy Markdown
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests, KeyedVectors and other small improvements needed.

Comment thread gensim/models/wrappers/varembed.py Outdated
"""
Load the input-hidden weight matrix from the fast text output files.

Note that due to limitations in the FastText API, you cannot continue training
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct Docstring to be about varembed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread gensim/models/wrappers/varembed.py Outdated
logger = logging.getLogger(__name__)


class VarEmbed(Word2Vec):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please subclass KeyedVectors

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread gensim/test/test_varembed_wrapper.py Outdated
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""
Automated tests for checking transformation algorithms (the models package).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to VarEmbed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Had missed this. Thanks. Done.

Comment thread gensim/test/test_varembed_wrapper.py Outdated
"""Test ensembling of Morhpeme Embeddings"""
model = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file,
morfessor_model=varembed_model_morfessor_file, use_morphemes=True)
self.model_sanity(model)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please test that syn0 is different compared to non-morpheme model.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added now.

Comment thread gensim/models/wrappers/varembed.py Outdated
if use_morphemes:
try:
import morfessor
morfessor_model = morfessor.MorfessorIO().read_binary_model_file(morfessor_model)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you raise an issue in varembed github as heads up that read_binary_model_file will be deprecated by morfessor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Sounds good. I've put up an issue rguthrie3/MorphologicalPriorsForWordEmbeddings#3 to notify them about the new release on morfessor as well.

…support is only provided in python 2.7 and above. Also added more comments
Comment thread gensim/models/wrappers/varembed.py Outdated
word_embeddings = D['word_embeddings']
morpho_embeddings = D['morpheme_embeddings']
result.load_word_embeddings(word_embeddings, word_to_ix)
if use_morphemes:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just if morfessor_model is enough

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds good. Done.

Comment thread gensim/models/wrappers/varembed.py Outdated
logger.info("Loaded matrix of %d size and %d dimensions", self.vocab_size, self.vector_size)


def ensemble_morpheme_embeddings(self, morfessor_model, morpho_embeddings, morpho_to_ix):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add_morphemes_to_word_embeddings?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed method name now.

@anmolgulati anmolgulati force-pushed the varembed-worker branch 2 times, most recently from 4f5e359 to 3095620 Compare February 2, 2017 14:35
@tmylk
Copy link
Copy Markdown
Contributor

tmylk commented Feb 2, 2017

Will there be a tutorial ipynb?

@anmolgulati
Copy link
Copy Markdown
Contributor Author

Yes, I'll just add one.

@anmolgulati
Copy link
Copy Markdown
Contributor Author

anmolgulati commented Feb 2, 2017

@tmylk I've now added a tutorial on varembed model as well. Please review once, and I feel we could merge it then.

@anmolgulati anmolgulati changed the title [WIP] Wrapper for Varembed Models Wrapper for Varembed Models Feb 2, 2017
Comment thread gensim/test/test_varembed_wrapper.py Outdated
Test only in Python 2.7 and above. Add Morphemes is not supported in earlier versions.
"""
model = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file)
model_with_morphemes = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file,
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: hanging indent please (not vertical indent).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread gensim/test/test_varembed_wrapper.py Outdated

@unittest.skipUnless(sys.version_info < (2, 7), 'Test to check throwing exception in Python 2.6 and earlier')
def testAddMorphemesThrowsExceptionInPython26(self):
self.assertRaises(Exception, varembed.VarEmbed.load_varembed_format, vectors=varembed_model_vector_file,
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent.

Copy link
Copy Markdown
Contributor Author

@anmolgulati anmolgulati Feb 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now.

Comment thread gensim/models/wrappers/varembed.py Outdated

This module allows ability to obtain word vectors for out-of-vocabulary words, for the Varembed model[2].

The wrapped model can NOT be updated with new documents for online training -- use gensim's `Word2Vec` for that.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that VarEmbed gensim wrapper doesn't support it? Also, someone might be confused that you are suggesting to load varembed and then train it as Word2Vec on new words which is incorrect

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, you are right. It could be bit confusing earlier. Updated it now.

@tmylk tmylk merged commit e1c3a0b into piskvorky:develop Feb 6, 2017
@tmylk
Copy link
Copy Markdown
Contributor

tmylk commented Feb 6, 2017

Thanks for adding the wrapper! It will be part of this week's release.

Let's add a benchmark notebook and a blog in another PR. Will publicize when that is ready.

@anmolgulati
Copy link
Copy Markdown
Contributor Author

Cool! Great! That's Awesome! :D
Yes sure, will open up a new PR for the benchmark and blog documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants