Fix Pipeline by kris-singh · Pull Request #1213 · piskvorky/gensim

kris-singh · 2017-03-14T02:19:38Z

Solves PR #932

kris-singh · 2017-03-14T03:08:19Z

Do we have to add sklearn as dependencies. Ready for review

tmylk

Please provide a pipeline in tests and ipynb where output is lda is used in logistic regression

tmylk · 2017-03-14T19:45:04Z


+    def testPipline(self):
+        model = SklearnWrapperLdaModel(id2word=dictionary, num_topics=2, passes=100, minimum_probability=0, random_state=numpy.random.seed(0))
+        text_lda = Pipeline([('model', model)])


Can a pipeline contain two things? From lda to logistic regression would be good. Also could you please add it to the tutorial.

Do, you mean to say that we use lda as a feature extractor. And then use it to in the logistic regression. I thought of this and modified the transform function accordingly.

tmylk · 2017-03-14T19:45:42Z

   "outputs": [],
-   "source": []
+   "source": [
+    "def scorer(estimator, X,y=None):\n",


This gridsearch returns exception in the ipynb. Is it possible to have it fixed?

kris-singh · 2017-03-16T20:29:44Z

@tmylk Could you have a look at the Travis . I don't understand why is it failing.

tmylk · 2017-03-17T00:38:43Z

Tests fixed by smart_open update

tmylk · 2017-03-17T00:57:06Z

            self.assertTrue(isinstance(v, six.string_types))
            self.assertTrue(isinstance(k, int))

+    def testPipline(self):


typo in name of the function

tmylk · 2017-03-17T00:57:35Z

+        data = fetch_20newsgroups(subset='train',
+                                  categories=cats,
+                                  shuffle=True)
+        text_lda = Pipeline([('features', vec),('model', model)])


please add logistic regression to the pipeline to analyse output of the lda

I do that in the ipynb example as you had suggested. Also, I am not getting good accuracy using the features from lda transform around 52% which is meaningless for a binary classification task.

please add it to the test.
accuracy is not important here. it is about being in compatible format

tmylk · 2017-03-17T00:58:00Z

+        vec = CountVectorizer(min_df=10, stop_words='english')
+        rand = numpy.random.mtrand.RandomState(1) # set seed for getting same result
+        cats = ['rec.sport.baseball', 'sci.crypt']
+        data = fetch_20newsgroups(subset='train',


there are smaller datasets in test_data folder. downloading a lot of data makes tests run too long

@tmylk i was not able to find a dataset that the labels. If you know can you please tell me which one to use.

a tiny 100k subset of newsgroups would be ok.

what is the size of the text docs that you are adding?

kris-singh · 2017-03-19T07:00:00Z

@tmylk All changes made. Ready for merge. Please let me know if further changes are required.

kris-singh · 2017-03-20T02:13:46Z

@tmylk any other issues that will help with nmf that i could possibly look at.

tmylk · 2017-03-20T18:49:49Z

@@ -86,19 +92,15 @@ def testCSRMatrixConversion(self):

    def testPipline(self):


typo in test name

tmylk · 2017-03-20T18:57:26Z

Thanks! The PR looks good.
For completeess, could you please remove the section inappropriately called "Using together with Scikit learn's Logistic Regression". That section doesn't use gensim at all so shouldn't be in the notebook. It's an omission by the original author.

Please put your new Pipeline section instead of it so users can find it faster.

kris-singh · 2017-03-21T13:33:41Z

Changes made. Also the size of the test file is around 300 kb.

tmylk · 2017-03-21T18:50:45Z

Thanks for the new feature!

piskvorky · 2017-04-09T07:05:13Z

@tmylk this PR has multiple coding style and PEP8 issues. Please do not merge PRs that are not ready for merging.

piskvorky · 2017-04-09T07:13:09Z

   "outputs": [],
   "source": [
-    "from sklearn import linear_model"
+    "def scorer(estimator, X,y=None):\n",


PEP8: space after comma.

piskvorky · 2017-04-09T07:14:23Z

+   },
+   "outputs": [],
+   "source": [
+    "id2word=Dictionary(map(lambda x : x.split(),data.data))\n",


map is discouraged -- use comprehensions and generators.

Also, PEP8 -- space after comma, spaces around =.

piskvorky · 2017-04-09T07:15:09Z

-    "clf=linear_model.LogisticRegression(penalty='l1', C=0.1) #l1 penalty used\n",
-    "clf.fit(X,data.target)\n",
-    "print_features(clf,vocab)"
+    "model=SklearnWrapperLdaModel(num_topics=15,id2word=id2word,iterations=50, random_state=37)\n",


PEP8: spaces around assignment operator =. Other space/formatting/PEP8 issues further down this file, but this is the last comment.

piskvorky · 2017-04-09T07:16:12Z

            X = matutils.Sparse2Corpus(X)

-        self.update(corpus=X)
+        self.update(corpus=X)


PEP8: newline at the end of file.

Fix Pipeline

8098e56

sklearn dependency

40ffca0

tmylk suggested changes Mar 14, 2017

View reviewed changes

cs15mtech11007@iith.ac.in added 5 commits March 16, 2017 04:48

Changes Added

36b8a81

Changes Made

7391fcc

minor fix

efe96e6

.travis

a133b49

try

970df21

Fix for >3.5

768e39b

tmylk reviewed Mar 17, 2017

View reviewed changes

cs15mtech11007@iith.ac.in added 3 commits March 19, 2017 10:46

Changes Made

3bccc20

Compressed Data

b709026

add data

6d15ae7

tmylk reviewed Mar 20, 2017

View reviewed changes

Typo Fixed

de600ba

tmylk merged commit 97cd64f into piskvorky:develop Mar 21, 2017

piskvorky reviewed Apr 9, 2017

View reviewed changes

		@@ -86,19 +92,15 @@ def testCSRMatrixConversion(self):

		def testPipline(self):

Uh oh!

Conversation

kris-singh commented Mar 14, 2017 • edited by tmylk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kris-singh commented Mar 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmylk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kris-singh commented Mar 16, 2017

Uh oh!

tmylk commented Mar 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kris-singh Mar 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kris-singh commented Mar 19, 2017

Uh oh!

kris-singh commented Mar 20, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmylk commented Mar 20, 2017

Uh oh!

kris-singh commented Mar 21, 2017

Uh oh!

tmylk commented Mar 21, 2017

Uh oh!

piskvorky commented Apr 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kris-singh commented Mar 14, 2017 •

edited by tmylk

Loading

kris-singh commented Mar 14, 2017 •

edited

Loading

kris-singh Mar 17, 2017 •

edited

Loading

piskvorky commented Apr 9, 2017 •

edited

Loading