Removing Doc2Vec defaults so that it won't override Word2Vec defaults. fix #795#929
Conversation
|
@gojomo please review. |
|
Some thoughts:
|
|
@gojomo I don't get the error. There is no "dm_mean" variable in either word2vec.py or doc2vec.py. Am I missing something? |
|
There was a |
|
In #795 it was said that
but according to the CI test files, in this line we have to initialize params like dm_mean, window etc. So to implement this either we can change the test files or hard-code the default params from What are your views on this @gojomo ? EDIT: About the PR #927, no were not on the same sprint. |
|
The added description is helpful – but a descriptive title (as edited atop the PR page) is even better, as it then appears in the Github list/notification views.
We don't want to change the tests – holding them stable through any changes helps assure us that we're not totally breaking user code. (We're still altering the effective default, which may affect user results.) Hard-coding |
|
|
| sample=sample, seed=seed, workers=workers, min_alpha=min_alpha, | ||
| sg=(1+dm) % 2, hs=hs, negative=negative, cbow_mean=dm_mean, | ||
| size=size, window=window, min_count=min_count, | ||
| sample=sample, workers=workers, |
There was a problem hiding this comment.
If size, window, min_count, sample, and workers all use the same defaults as Word2Vec, there's no need to re-specify them in this method signature nor pass to the call of superclass __init__().
There was a problem hiding this comment.
For doc2vec the default params are size=300, window=8, min_count=5, sample=0, workers=1 and for wordvec size=100, window=5, min_count=5, sample=1e-3, workers=3. Removing these none common default params from doc2vec raise a Travis build error.
There was a problem hiding this comment.
The goal of issue #795 is to harmonize these defaults, unless there is a specific reason for the Doc2Vec defaults to differ. (Is there a reason to prefer size=300', 'window=8',sample=0', or 'workers=1as being more-sensible defaults in Doc2Vec?) And something likemin_count=5` already matches Word2Vec – so no need to redundantly specify it again.
There was a problem hiding this comment.
No that is definitely not the case, in the first commit that I submitted I removed these variables too but the Travis build seem to fail. Maybe we need to change the tests as well. What do you advice?
There was a problem hiding this comment.
It would depend specifically on what tests fail and why. (Looking at the failures at that time, it looks the problem was you retained a reference to size that was no longer valid: you'd taken our too little, not too much.) If at all possible, the tests shouldn't change: they help confirm that the refactoring didn't break anything else. But it might be the case that a test was dependent on the old defaults, and if that's the case, it could be appropriate to adjust the test as a last resort.
There was a problem hiding this comment.
For initialising the default params of doc2vec if I remove all the params passed on to super(Doc2Vec, self).init(....). In that case even if you pass some parameter to doc2vec. Suppose "workers", even in that case self.workers for both word2vec and doc2vec will be defaulted to 3(the default param of word2vec ). which implies that not passing the parameters at all to super(Doc2Vec, self).init(....) results in taking the default params of word2vec irrespective of whatever params you pass.
In the case when you pass all the params to super(Doc2Vec, self).init(size=size,alpha=alpha...) the default params of doc2vec will always be considered.
and defaults of word2vec will never be used.
I did some digging, but could not find a way (and possibly no way exist) to know whether the param is the default one or passed explicitly by the user.
I want to know your opinion about this.
There was a problem hiding this comment.
The use of **kwargs handles this. If you pass workers=5 to Doc2Vec, if it's not explicitly declared in the __init__() signature, it stays in the kwargs dict. That's then passed to the superclass __init__() – overriding the default of 3. So the desired effect is achieved: shared defaults exist in one place (Word2Vec), but calls to either can override with an explicit valiue.
(As noted in a prior comment, only dm_mean is especially tricky, needing a different approach, because it's a de facto synonym for cbow_mean. The goal is to have it so that if dm_mean is unspecified, the default Word2Vec cbow_mean value is operative, but if specified, it sets cbow_mean to the specified value. It's definitely also possible, and a solution might rely on the fact kwargs is a normal dict.)
| self.total_train_time = 0 | ||
| self.sorted_vocab = sorted_vocab | ||
| self.batch_words = batch_words | ||
| self.size = size |
There was a problem hiding this comment.
This new instance variable seems unnecessary; better to consult existing layer1_size.
9b1ac77 to
c144bf3
Compare
| self.cbow_mean = int(kwargs["dm_mean"]) | ||
| else: | ||
| self.cbow_mean = int(cbow_mean) | ||
|
|
There was a problem hiding this comment.
On the right track, but since 'dm_mean' is a Doc2Vec-specific parameter, its definition and handling would be best localized there (where it is also documented), rather than in Word2Vec.
And while the open-ended kwargs is helpful in Doc2Vec.__init__ for shuttling parameters to a related method, to have it here, in the terminal method, would mean that stray extra arguments offered in error (including misspelled parameters) would not generate any error. (Before the change, if neither Doc2Vec.__init__ nor Word2Vec.__init__ handle them, they would generate an error – which is helpful behavior.)
Try leaving dm_mean as a part of the Doc2Vec.__init__ signature, but with a default value of None. Within the __init__ body, check for a user-supplied value that should then be used as self.cbow_mean.
| wc = WikiCorpus(datapath(FILENAME)) | ||
|
|
||
| def test_get_texts_returns_generator_of_lists(self): | ||
|
|
There was a problem hiding this comment.
No real reason to have these whitespace changes in this unrelated patch.
There was a problem hiding this comment.
It was suggested by Lev to leave a blank line to pass the Travis build. I am not sure why.
There was a problem hiding this comment.
It should be removed. it was just a travis ping.
| """Test doc2vec training.""" | ||
| corpus = DocsLeeCorpus() | ||
| model = doc2vec.Doc2Vec(size=100, min_count=2, iter=20) | ||
| model = doc2vec.Doc2Vec(size=100, min_count=2, iter=20, window=8, sample=0.01, workers=1) |
There was a problem hiding this comment.
Working with a tiny corpus, workers=1 is a good idea (and I believe what most of the Word2Vec tests do). Were the other two parameter changes also necessary for the test to still pass? (It's good to have checked it still did, but if holding them same-as-before isn't strictly necessary, better to from-here-on-out test the new defaults.)
There was a problem hiding this comment.
Tests work fine without them so removed the other two params.
|
|
||
| # build vocab and train in one step; must be the same as above | ||
| model2 = doc2vec.Doc2Vec(corpus, size=100, min_count=2, iter=20) | ||
| model2 = doc2vec.Doc2Vec(corpus, size=100, min_count=2, iter=20, window=8, sample=0.01, workers=1) |
There was a problem hiding this comment.
As above, workers=1 is good but for the long run the other parameters should only be locked to historic values if definitely necessary.
|
|
||
| model = doc2vec.Doc2Vec(min_count=1) | ||
| size = 300 | ||
| model = doc2vec.Doc2Vec(min_count=1, size=size) |
There was a problem hiding this comment.
If these tests would succeed with the new (inherited) default size of 100, I think it'd be better to now run them that way – and then test the internal array sizes for either matching '100' or (perhaps simply to verify internal self-consistency) 'model.vector_size'. (Though, it's good that they've been run at least once with parameters matching and passing the old values – as a one-time check against regression in that config.)
There was a problem hiding this comment.
Seem to fail with the size = 100.
There was a problem hiding this comment.
There's both a 300 (count of text examples, length of syn0) that needs to be tested, and a 100 (dimensionality of individual vectors) - you've combined them both into size. These tests will pass at default (100) dimensionality, as long as you're not asserting the example-count is also 100.
|
|
||
| model = doc2vec.Doc2Vec(min_count=1) | ||
| size = 300 | ||
| model = doc2vec.Doc2Vec(size=size, min_count=1) |
There was a problem hiding this comment.
As above, though holding test params constant for at least one cycle was useful, if new-default value would also work as suitable test, best to have it be the value going forward.
There was a problem hiding this comment.
This too, seem to fail with the size = 100.
49163a0 to
9bf5cae
Compare
|
The build pass perfectly in my forked repo's Travis and VM. |
|
Yes, there's a test that's often hanging on Python 2.7 that's unrelated to your changes (TestWikiCorpus.test_first_element). I've made a separate issue for that (#971). (It may be that forcing re-tries will cause it to eventually succeed; it may be that something unique on the test machine is now causing it to reliably fail; in either case, when you see a failure there, it's not the fault of your changes.) Regarding the handling of |
718fbc6 to
0a353bb
Compare
|
Removed |
|
Ping @gojomo |
|
Looks good to me. Waiting for @gojomo review. |
|
Chnages all look good! The unrelated test timeout (wiki_corpus-related) means I'm not sure all Word2Vec/Doc2Vec tests have passed for the change, though. |
|
@markroxor Please rebase to resolve merge conflicts. |
more parameters excluded
213b4ec to
98d0dc3
Compare
|
ping @tmylk |
|
Thanks for the PR! |
Removing Doc2Vec defaults so that it won't override Word2Vec defaults.