Skip to content

Fix Pipeline#1213

Merged
tmylk merged 12 commits into
piskvorky:developfrom
kris-singh:PR#932
Mar 21, 2017
Merged

Fix Pipeline#1213
tmylk merged 12 commits into
piskvorky:developfrom
kris-singh:PR#932

Conversation

@kris-singh
Copy link
Copy Markdown
Contributor

@kris-singh kris-singh commented Mar 14, 2017

Solves PR #932

@kris-singh
Copy link
Copy Markdown
Contributor Author

kris-singh commented Mar 14, 2017

Do we have to add sklearn as dependencies. Ready for review

Copy link
Copy Markdown
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a pipeline in tests and ipynb where output is lda is used in logistic regression

Comment thread gensim/test/test_sklearn_integration.py Outdated

def testPipline(self):
model = SklearnWrapperLdaModel(id2word=dictionary, num_topics=2, passes=100, minimum_probability=0, random_state=numpy.random.seed(0))
text_lda = Pipeline([('model', model)])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a pipeline contain two things? From lda to logistic regression would be good. Also could you please add it to the tutorial.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do, you mean to say that we use lda as a feature extractor. And then use it to in the logistic regression. I thought of this and modified the transform function accordingly.

Comment thread docs/notebooks/sklearn_wrapper.ipynb Outdated
"outputs": [],
"source": []
"source": [
"def scorer(estimator, X,y=None):\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gridsearch returns exception in the ipynb. Is it possible to have it fixed?

@kris-singh
Copy link
Copy Markdown
Contributor Author

@tmylk Could you have a look at the Travis . I don't understand why is it failing.

@tmylk
Copy link
Copy Markdown
Contributor

tmylk commented Mar 17, 2017

Tests fixed by smart_open update

Comment thread gensim/test/test_sklearn_integration.py Outdated
self.assertTrue(isinstance(v, six.string_types))
self.assertTrue(isinstance(k, int))

def testPipline(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in name of the function

Comment thread gensim/test/test_sklearn_integration.py Outdated
data = fetch_20newsgroups(subset='train',
categories=cats,
shuffle=True)
text_lda = Pipeline([('features', vec),('model', model)])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add logistic regression to the pipeline to analyse output of the lda

Copy link
Copy Markdown
Contributor Author

@kris-singh kris-singh Mar 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do that in the ipynb example as you had suggested. Also, I am not getting good accuracy using the features from lda transform around 52% which is meaningless for a binary classification task.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add it to the test.
accuracy is not important here. it is about being in compatible format

Comment thread gensim/test/test_sklearn_integration.py Outdated
vec = CountVectorizer(min_df=10, stop_words='english')
rand = numpy.random.mtrand.RandomState(1) # set seed for getting same result
cats = ['rec.sport.baseball', 'sci.crypt']
data = fetch_20newsgroups(subset='train',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are smaller datasets in test_data folder. downloading a lot of data makes tests run too long

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmylk i was not able to find a dataset that the labels. If you know can you please tell me which one to use.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a tiny 100k subset of newsgroups would be ok.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the size of the text docs that you are adding?

@kris-singh
Copy link
Copy Markdown
Contributor Author

@tmylk All changes made. Ready for merge. Please let me know if further changes are required.

@kris-singh
Copy link
Copy Markdown
Contributor Author

@tmylk any other issues that will help with nmf that i could possibly look at.

Comment thread gensim/test/test_sklearn_integration.py Outdated
@@ -86,19 +92,15 @@ def testCSRMatrixConversion(self):

def testPipline(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in test name

@tmylk
Copy link
Copy Markdown
Contributor

tmylk commented Mar 20, 2017

Thanks! The PR looks good.
For completeess, could you please remove the section inappropriately called "Using together with Scikit learn's Logistic Regression". That section doesn't use gensim at all so shouldn't be in the notebook. It's an omission by the original author.

Please put your new Pipeline section instead of it so users can find it faster.

@kris-singh
Copy link
Copy Markdown
Contributor Author

Changes made. Also the size of the test file is around 300 kb.

@tmylk tmylk merged commit 97cd64f into piskvorky:develop Mar 21, 2017
@tmylk
Copy link
Copy Markdown
Contributor

tmylk commented Mar 21, 2017

Thanks for the new feature!

@piskvorky
Copy link
Copy Markdown
Owner

piskvorky commented Apr 9, 2017

@tmylk this PR has multiple coding style and PEP8 issues. Please do not merge PRs that are not ready for merging.

"outputs": [],
"source": [
"from sklearn import linear_model"
"def scorer(estimator, X,y=None):\n",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space after comma.

},
"outputs": [],
"source": [
"id2word=Dictionary(map(lambda x : x.split(),data.data))\n",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map is discouraged -- use comprehensions and generators.

Also, PEP8 -- space after comma, spaces around =.

"clf=linear_model.LogisticRegression(penalty='l1', C=0.1) #l1 penalty used\n",
"clf.fit(X,data.target)\n",
"print_features(clf,vocab)"
"model=SklearnWrapperLdaModel(num_topics=15,id2word=id2word,iterations=50, random_state=37)\n",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: spaces around assignment operator =. Other space/formatting/PEP8 issues further down this file, but this is the last comment.

X = matutils.Sparse2Corpus(X)

self.update(corpus=X)
self.update(corpus=X) No newline at end of file
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: newline at the end of file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants