Skip to content

Fix train error of ConcatenatedDoc2Vec in the notebook of doc2vec-IMDB#1377

Merged
menshikh-iv merged 25 commits into
piskvorky:developfrom
robotcator:fix-notebook
Jul 7, 2017
Merged

Fix train error of ConcatenatedDoc2Vec in the notebook of doc2vec-IMDB#1377
menshikh-iv merged 25 commits into
piskvorky:developfrom
robotcator:fix-notebook

Conversation

@robotcator

Copy link
Copy Markdown
Contributor

No description provided.

Comment thread docs/notebooks/doc2vec-IMDB.ipynb Outdated
" if not isinstance(train_model, ConcatenatedDoc2Vec):\n",
" train_model.train(doc_list, total_examples=train_model.corpus_count, epochs=train_model.iter)\n",
" else:\n",
" train_model.train(doc_list)\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simpler and more robust fix would be to change the ConcatenatedDoc2Vec class, in test_doc2vec.py, to make its (no-op) train() match the new train() parameters-signature.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. If the train() method is modified, the total_examples and epochs should be provided. But the ConcatenatedDoc2Vec class has no attribute 'corpus_count'.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call doesn't have to use train_model.corpus_count from inside the model - it can just use len(doc_list). And since the outside loop is handling the multiple passes, the epochs argument should be 1.

@menshikh-iv

Copy link
Copy Markdown
Contributor

@robotcator Please merge develop to your branch (missing tensorflow in Travis config).

Comment thread gensim/test/test_doc2vec.py Outdated
def train(self, sentences, total_examples=None, total_words=None,
epochs=None, start_alpha=None, end_alpha=None,
word_count=0,
queue_factor=2, report_delay=1.0):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a no-op implementation that ignores any arguments, it can be specified even more compactly and generically, using Python's syntax for arbitrary positional/keyword arguments. EG:

def train(self, *ignored, **kwignored):

@menshikh-iv

Copy link
Copy Markdown
Contributor

@robotcator I open this notebook and get error Notebook validation failed: u'execution_count' is a required property: ..., can you please re-run all cells and commit it (this should help)

@robotcator

Copy link
Copy Markdown
Contributor Author

@menshikh-iv all the cells were re-ran and the error was fixed.

"*0.193200 : 1 passes : Doc2Vec(dbow,d100,n5,mc2,s0.001,t4)_inferred 35.3s 48.3s\n",
"*0.268640 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4) 48.6s 48.5s\n",
"*0.208000 : 1 passes : Doc2Vec(dm/m,d100,n5,w10,mc2,s0.001,t4)_inferred 48.6s 47.4s\n",
"*0.216160 : 1 passes : dbow+dmm 0.0s 168.9s\n",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in reported elapsed-time for these concatenated-model results, here and below, is so large it deserves a closer look. I'm no aware of any change, in this PR or prior related work, that could account for such changes – from <2.0 seconds to >170 seconds. It might be a spurious report, or something amiss.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the concatenated-model train method didn't do anything actually. I will check what' going on this.

@menshikh-iv

Copy link
Copy Markdown
Contributor

@robotcator What's a status of this PR?

@menshikh-iv

Copy link
Copy Markdown
Contributor

Ping @robotcator

@robotcator

Copy link
Copy Markdown
Contributor Author

@menshikh-iv sorry for late reply, I didn't receive last notification. This PR seems works fine and all the cell have been rerun. The warning of notebook is gone. It's able to be merged .

@menshikh-iv menshikh-iv merged commit 3e38e33 into piskvorky:develop Jul 7, 2017
@gojomo

gojomo commented Jul 7, 2017

Copy link
Copy Markdown
Collaborator

There's still the suspiciously-different reported-runtimes on some of the log lines.

@menshikh-iv

Copy link
Copy Markdown
Contributor

@gojomo As I see, we have only two code change here, and here. Another change from notebook re-run. I don't see any suspicious things in code changes, although elapsed time looks strange.

@gojomo

gojomo commented Jul 8, 2017

Copy link
Copy Markdown
Collaborator

In fact since the last time the notebook was run and its output committed, there have been many changes (outside this PR). So many things could cause the discrepancy; I just suggest the big runtime difference be understood alongside the updated notebook cells.

@piskvorky

Copy link
Copy Markdown
Owner

That's indeed very worrying. @menshikh-iv can you find which commit / PR caused this slowdown?

@robotcator

Copy link
Copy Markdown
Contributor Author

is there any possible that the performance of computer which results in the difference of output. Because I rerun that notebook twice and the second seems alright.

@menshikh-iv

Copy link
Copy Markdown
Contributor

@piskvorky accepted. I will search and fix if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants