Fix Pipeline#1213
Conversation
|
Do we have to add sklearn as dependencies. Ready for review |
tmylk
left a comment
There was a problem hiding this comment.
Please provide a pipeline in tests and ipynb where output is lda is used in logistic regression
|
|
||
| def testPipline(self): | ||
| model = SklearnWrapperLdaModel(id2word=dictionary, num_topics=2, passes=100, minimum_probability=0, random_state=numpy.random.seed(0)) | ||
| text_lda = Pipeline([('model', model)]) |
There was a problem hiding this comment.
Can a pipeline contain two things? From lda to logistic regression would be good. Also could you please add it to the tutorial.
There was a problem hiding this comment.
Do, you mean to say that we use lda as a feature extractor. And then use it to in the logistic regression. I thought of this and modified the transform function accordingly.
| "outputs": [], | ||
| "source": [] | ||
| "source": [ | ||
| "def scorer(estimator, X,y=None):\n", |
There was a problem hiding this comment.
This gridsearch returns exception in the ipynb. Is it possible to have it fixed?
|
@tmylk Could you have a look at the Travis . I don't understand why is it failing. |
|
Tests fixed by smart_open update |
| self.assertTrue(isinstance(v, six.string_types)) | ||
| self.assertTrue(isinstance(k, int)) | ||
|
|
||
| def testPipline(self): |
There was a problem hiding this comment.
typo in name of the function
| data = fetch_20newsgroups(subset='train', | ||
| categories=cats, | ||
| shuffle=True) | ||
| text_lda = Pipeline([('features', vec),('model', model)]) |
There was a problem hiding this comment.
please add logistic regression to the pipeline to analyse output of the lda
There was a problem hiding this comment.
I do that in the ipynb example as you had suggested. Also, I am not getting good accuracy using the features from lda transform around 52% which is meaningless for a binary classification task.
There was a problem hiding this comment.
please add it to the test.
accuracy is not important here. it is about being in compatible format
| vec = CountVectorizer(min_df=10, stop_words='english') | ||
| rand = numpy.random.mtrand.RandomState(1) # set seed for getting same result | ||
| cats = ['rec.sport.baseball', 'sci.crypt'] | ||
| data = fetch_20newsgroups(subset='train', |
There was a problem hiding this comment.
there are smaller datasets in test_data folder. downloading a lot of data makes tests run too long
There was a problem hiding this comment.
@tmylk i was not able to find a dataset that the labels. If you know can you please tell me which one to use.
There was a problem hiding this comment.
a tiny 100k subset of newsgroups would be ok.
There was a problem hiding this comment.
what is the size of the text docs that you are adding?
|
@tmylk All changes made. Ready for merge. Please let me know if further changes are required. |
|
@tmylk any other issues that will help with nmf that i could possibly look at. |
| @@ -86,19 +92,15 @@ def testCSRMatrixConversion(self): | |||
|
|
|||
| def testPipline(self): | |||
|
Thanks! The PR looks good. Please put your new Pipeline section instead of it so users can find it faster. |
|
Changes made. Also the size of the test file is around 300 kb. |
|
Thanks for the new feature! |
|
@tmylk this PR has multiple coding style and PEP8 issues. Please do not merge PRs that are not ready for merging. |
| "outputs": [], | ||
| "source": [ | ||
| "from sklearn import linear_model" | ||
| "def scorer(estimator, X,y=None):\n", |
| }, | ||
| "outputs": [], | ||
| "source": [ | ||
| "id2word=Dictionary(map(lambda x : x.split(),data.data))\n", |
There was a problem hiding this comment.
map is discouraged -- use comprehensions and generators.
Also, PEP8 -- space after comma, spaces around =.
| "clf=linear_model.LogisticRegression(penalty='l1', C=0.1) #l1 penalty used\n", | ||
| "clf.fit(X,data.target)\n", | ||
| "print_features(clf,vocab)" | ||
| "model=SklearnWrapperLdaModel(num_topics=15,id2word=id2word,iterations=50, random_state=37)\n", |
There was a problem hiding this comment.
PEP8: spaces around assignment operator =. Other space/formatting/PEP8 issues further down this file, but this is the last comment.
| X = matutils.Sparse2Corpus(X) | ||
|
|
||
| self.update(corpus=X) | ||
| self.update(corpus=X) No newline at end of file |
There was a problem hiding this comment.
PEP8: newline at the end of file.
Solves PR #932