Skip to content

Fix the comment of Translation Matrix#1594

Merged
menshikh-iv merged 8 commits into
piskvorky:developfrom
robotcator:mydevelop
Sep 25, 2017
Merged

Fix the comment of Translation Matrix#1594
menshikh-iv merged 8 commits into
piskvorky:developfrom
robotcator:mydevelop

Conversation

@robotcator
Copy link
Copy Markdown
Contributor

No description provided.

@robotcator robotcator changed the title Fix the comment of Tr Fix the comment of Translation Matrix Sep 19, 2017
Comment thread gensim/models/translation_matrix.py Outdated
Args:
`word_pair` (list): a list pair of words
`word_pairs` (list): a list pair of words
`source_space` (Space object): source language space
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train method use only word_pairs, what is source/target space here?

Comment thread gensim/models/translation_matrix.py Outdated
self.source_space = Space.build(self.source_lang_vec, set(self.source_word))
self.target_space = Space.build(self.target_lang_vec, set(self.target_word))
self.source_word, self.target_word = zip(*word_pairs)
if self.translation_matrix is None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if I called train twice, in the second time I don't fit model.
Please remove this if


with utils.smart_open(self.train_file, "r") as f:
self.word_pair = [tuple(utils.to_unicode(line).strip().split()) for line in f]
self.word_pairs = [("one", "uno"), ("two", "due"), ("three", "tre"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use hanging indents

Comment thread gensim/test/test_translation_matrix.py Outdated
("grape", "acino"), ("banana", "banana"), ("mango", "mango")
]

self.test_word_pairs = [("ten", "dieci"), ("dog", "cane"), ("cat", "gatto")]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove ("dog", "cane") from self.word_pairs

Comment thread gensim/models/translation_matrix.py Outdated

def normalize(self):
""" normalized the word vector's matrix """
""" Normalized the word vector's matrix """
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Normalize…' (imperative rather than past-tense)

Comment thread gensim/models/translation_matrix.py Outdated
def apply_transmat(self, words_space):
"""
mapping the source word vector to the target word vector using translation matrix
Mapping the source word vector to the target word vector using translation matrix
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Map…' (imperative rather than '-ing' form)

@gojomo
Copy link
Copy Markdown
Collaborator

gojomo commented Sep 21, 2017

The handling of word_pairs in __init__() and train() now makes sense, thanks. The comments have been improved but still may benefit from a deep review for clarity/wording.

Though I know I requested the Doc2Vec-related example, in its current form the motivations/benefits are muddled. Really it shouldn't require a separate helper class (BackMappingTranslationMatrix), and the notebook section ("Tranlation Matrix Revisit") is hard-to-follow, and includes a bunch of improper practices. (For example: using an imbalanced set of docs for the mapping 'overlap'; using the slow-and-iffy dm_concat mode; calling train() multiple times with a sawtooth alpha progression, etc.)

The word-translation example can presumably be evaluated based on real datasets in the original context that motivated the approach, while the doc-vec example will need more novel design/evaluation – so I'd recommend splitting them into separate notebooks.

@robotcator
Copy link
Copy Markdown
Contributor Author

robotcator commented Sep 21, 2017

Thanks. You do remind me the imbalanced set problem in the example. And the code for training document vector are borrowed from the doc2vec-imdb.ipynb and I will re-train the document vector.
As for the imbalanced data, how to sample documents to 'overlap' according to the sentiment or whether it is in the train or test set. (according to the sentiment to sample is more logical)

For the BackMappingTranslationMatrix class, I didn't find a good way to integrate this function into my TranslationMatrix class, so I separate this into two class. Because the word2vec and doc2vec has different method to access the vector.

for word2vec, use model[word] to get the word vector.
for doc2vec, use model.doc2vec['doc_tag'] to get the document vector.

If BackMappingTranslationMatrix was integrated into TranslationMatrix, I would handle them separately according to the type (ininstance method), is it appropriate?

I didn't catch that The word-translation example can be evaluated based on real datasets in the original context , can you please explain in more detail?

@gojomo
Copy link
Copy Markdown
Collaborator

gojomo commented Sep 21, 2017

I meant there are published papers about using word-vector transformations for language-translation - the original Google paper, the Dinu paper – so there are specific datasets & procedures to mimic – and similar results would indicate everything is working. The Doc2Vec use is novel so requires more experimentation/thought.

@robotcator
Copy link
Copy Markdown
Contributor Author

robotcator commented Sep 22, 2017

The word pairs used in this experiment are extracted from the OPUS(http://opus.lingfil.uu.se/). The same as the Ninu's paper. I plot the vis to show the linear relationship between two language vector space and use the word translation to show this transformation works. More re-produced experiments from the Mikolov's and Ninu's paper would be fine to support this transformation. But I still can not find any experiment for language translation(Do it mean sentences translation ) from the two paper I mentioned here? Can you remind me If I miss something?

I added "unstable/experimental" warning tag in notebook explicitly for doc2vec transformation part as Ivan suggested.

I also found a paper (OFFLINE BILINGUAL WORD VECTORS,ORTHOGONALTRANSFORMATIONS AND THE INVERTED SOFTMAX) is related to this experiment. But I‘m just having a look and need dig deeper.

@menshikh-iv
Copy link
Copy Markdown
Contributor

Thank you @robotcator for fast fixes.
Thank you @gojomo for a review:+1:

@menshikh-iv
Copy link
Copy Markdown
Contributor

Need to fix some typos/pep8 issues in a notebook, but I can't wait for more, it's release time.

@menshikh-iv menshikh-iv merged commit 33a3ef2 into piskvorky:develop Sep 25, 2017
@robotcator robotcator deleted the mydevelop branch September 25, 2017 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants