[MRG] Keras wrapper for Word2Vec model in Gensim by chinmayapancholi13 · Pull Request #1248 · piskvorky/gensim

chinmayapancholi13 · 2017-03-30T12:49:51Z

This PR adds a Keras wrapper for Word2Vec Model in Gensim.

…into develop

chinmayapancholi13 · 2017-03-30T13:15:00Z

@tmylk I have tried to use the wrapper for a smaller version of the 20NewsGroup task. The code used is as follows. (It is based on the code used here)

from __future__ import print_function

import os
import sys
import numpy as np
from gensim.sklearn_integration.keras_wrapper_gensim_word2vec import KerasWrapperWord2VecModel
from gensim.models import word2vec
from keras.engine import Input
from keras.layers import merge
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model

BASE_DIR = ''
TEXT_DATA_DIR = BASE_DIR + './path/to/text/data/dir'
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
BATCH_SIZE = 128

# prepare text samples and their labels

texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids

for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                if sys.version_info < (3,):
                    f = open(fpath)
                else:
                    f = open(fpath, encoding='latin-1')
                t = f.read()
                i = t.find('\n\n')  # skip header
                if 0 < i:
                    t = t[i:]
                texts.append(t)
                f.close()
                labels.append(label_id)

# vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(labels))

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

# train the embedding matrix
data1 = word2vec.LineSentence('./path/to/input/data')
Keras_w2v = KerasWrapperWord2VecModel(data1, min_count=1)
embedding_layer = Keras_w2v.get_embedding_layer()

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=BATCH_SIZE)

tmylk · 2017-03-30T13:15:32Z

Please add unit tests and an ipynb

chinmayapancholi13 · 2017-03-30T13:23:07Z

@tmylk Sure.

chinmayapancholi13 · 2017-04-06T14:16:01Z

@tmylk I have added unit tests for word similarity task (using cosine distance) as well as a smaller version of the 20NewsGroups classification task. I have also created an IPython notebook for the wrapper explaining both these examples.

tmylk · 2017-05-02T20:13:58Z

+Addresses of Atheist Organizations
+USA
+FREEDOM FROM RELIGION FOUNDATION
+Darwin fish bumper stickers and assorted other atheist paraphernalia are available from the Freedom From Religion Foundation in the US.


Please put data into test/test_data

@tmylk The data used in the unit tests is present in test/test_data folder already. The data used in the IPython notebooks is present in docs/notebooks/datasets folder. We are using the same data at both places so to avoid unnecessary duplication, should I use the data in test/test_data (i.e. set the path accordingly in the ipynb) in the ipynb notebooks as well?

yes, test/test_data is a better location for data used in both

Got it. So, I'll change the path set in the ipynb for this functionality.

tmylk · 2017-05-02T21:35:08Z

+#
+# Copyright (C) 2011 Radim Rehurek <radimrehurek@seznam.cz>
+# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html
+"""Keras wrappers for gensim.


"Wrapper to allow gensim word2vec as input into Keras." is more clear. And in other docstrings and ipynbs too

@tmylk The file __init__.py is common for the entire folder gensim/keras_integration (which in turn would have the files for integration of the various models with Keras). So shouldn't the docstring here be more generic (like it already is)?

It can be more general, just pointing out the difference in meaning between "keras wrapper for gensim" vs "gensim wrapper for keras". Which one is it?
I think it is a "Wrapper to allow gensim models as input into Keras."

Okay. Understood it now. Thanks for pointing this out. So, I'll change "Keras wrappers for gensim" to "Wrappers to allow gensim models as input into Keras".

tmylk · 2017-05-02T21:41:03Z

Thanks for the feature. The change seems to be so small - just 5 lines in get_embedding_layer()

Instead of creating a new class, could you please just add this one method to KeyedVectors?

chinmayapancholi13 · 2017-05-03T22:46:26Z

@tmylk Sure. I'll add the function in KeyedVectors instead of creating a new class.

…definition

tmylk · 2017-05-18T11:38:14Z

@chinmayapancholi13 this is more appropriate to have in KeyedVectors class.
And to have an example integration with Classification in https://github.com/stephenhky/PyShortTextCategorization/blob/db246e3ade2fcdea58953ff807d259464765a661/shorttext/classifiers/embed/nnlib/VarNNEmbedVecClassification.py

chinmayapancholi13 · 2017-05-18T18:26:19Z

@tmylk I have moved the function get_embedding_layer to keyedvectors.py. I am adding the classification example now.
Also, the tests are failing because of PEP8 checking being done in the test data. Is this what we expect?

…into keras_wrapper_word2vec

chinmayapancholi13 · 2017-05-29T09:49:03Z

@tmylk Thanks a lot for your feedback. I have incorporated your suggestions as follows :

rename model_wv as wv : Done
cosine of 2 vectors is not the probability of them occuring together : I have replaced the comment with output is the cosine distance between the two words (as a similarity measure)
Please fix The Merge layer is deprecated and will be removed after 08/2017 : I have replaced merge with dot function (with param normalize set to True for taking the cosine distance). The particular warning is no longer there.
Please compare the result of keras cos model with a simple wv[word_b].dot(wv[word_a]). it should be the same : The values are the same when we set normalize as False in the dot function. This is because in such a case the output value is only the dot product (and not the cosine distance) of the two vectors.
Would we want this check to be explicitly added somewhere in the unit-tests or the ipynb notebook? Or was this checking only for our verification of the behavior of the Keras function?
In the final cell the score of {'mathematics': 0.97023982, is very good. Please call it a good result! : I have updated the comment in the last cell.

tmylk · 2017-05-30T00:00:05Z

@@ -0,0 +1,8 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-


This file and folder are no longer needed

Right! Removing this folder right away.

tmylk · 2017-05-30T11:38:53Z

    'scikit-learn',
    'pyemd',
    'annoy',
+    'theano',


would only one of them be enough? tf is preferred

tmylk · 2017-06-02T12:03:05Z

Let's add a note about shorttext to ipynb.

And then LGTM

menshikh-iv · 2017-06-03T14:57:38Z

Need to fix problems with travis (so strange, because keras is installed, but in log I see unittest.case.SkipTest: Test requires Keras to be installed, which is not available)

chinmayapancholi13 · 2017-06-03T16:25:18Z

@menshikh-iv This is because Keras also needs Tensorflow to be installed (when Tensorflow is the default backend). And we are not installing Tensorflow in .travis.yml because if we do pip install tensorflow, the 'only CPU-support' version gets installed by default. So if a user already has GPU-supported version of tensorflow (using pip install tensorflow-gpu) installed, it would get overwritten. This was also the reason behind removing automatic installation of TF while installation of Keras (see keras-team/keras#5776 (comment)).

Thus, we(similar to Keras) expect the users to install Tensorflow by themselves.

menshikh-iv · 2017-06-03T16:36:46Z

@chinmayapancholi13 Thank for your clarification, but what will we do with Travis?

chinmayapancholi13 · 2017-06-03T16:41:03Z

@menshikh-iv One solution can be to include TF in the installation and add a note that users would have to re-install the TF of their choice again (i.e. in case it gets overwritten). For the tests to get passed in travis, I believe we must install TF somehow so this can be a solution in my opinion.

menshikh-iv · 2017-06-04T08:01:49Z

@chinmayapancholi13 Sounds good for me, let's do it.

menshikh-iv · 2017-06-04T17:03:58Z

@chinmayapancholi13 Great 🥇

piskvorky

Thanks for the interesting feature and notebook!

There are some code style issues -- can you fix that? @menshikh-iv

piskvorky · 2017-06-05T02:51:30Z

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then, we call the wrapper and pass appropriate parameters."


What wrapper?

Here and below, the text refers to some "wrapper", but I see no wrapper.

piskvorky · 2017-06-05T02:54:21Z

+   "source": [
+    "word_a = 'graph'\n",
+    "word_b = 'trees'\n",
+    "output = keras_model.predict([np.asarray([model.wv.vocab[word_a].index]), np.asarray([model.wv.vocab[word_b].index])])  # output is the cosine distance between the two words (as a similarity measure)\n",


The comment would be better moved after the code, on a separate line, to improve readability.

piskvorky · 2017-06-05T02:55:28Z

+   "source": [
+    "# global variables\n",
+    "\n",
+    "nb_filters=1200  # number of filters\n",


PEP8: space around the assignment operator (x = 1).

Here and in my other places in the notebook.

piskvorky · 2017-06-05T02:56:45Z

+    "    category_col, descp_col = df.columns.values.tolist()\n",
+    "    shorttextdict = defaultdict(lambda : [])\n",
+    "    for category, descp in zip(df[category_col], df[descp_col]):\n",
+    "        if type(descp)==str:\n",


Will this work across Python 2 / Python 3?

Yes. It is working fine for both Python 2 and 3.

piskvorky · 2017-06-05T02:57:10Z

+    "    shorttextdict = defaultdict(lambda : [])\n",
+    "    for category, descp in zip(df[category_col], df[descp_col]):\n",
+    "        if type(descp)==str:\n",
+    "            shorttextdict[category] += [descp]\n",


append simpler, faster and more readable?

Also, if you need to convert to plain dict anyway below, a plain setdefault(key, []).append(x) may be easier than defaultdict.

piskvorky · 2017-06-05T02:58:04Z

+    "    \"\"\"\n",
+    "    df = pd.read_csv(filepath)\n",
+    "    category_col, descp_col = df.columns.values.tolist()\n",
+    "    shorttextdict = defaultdict(lambda : [])\n",


defaultdict(list)

piskvorky · 2017-06-05T02:59:06Z

+    "    Return an example data set, with three subjects and corresponding keywords.\n",
+    "    This is in the format of the training input.\n",
+    "    \"\"\"\n",
+    "    data_path = './datasets/keras_classifier_training_data.csv'\n",


os.path.join better (will work on Windows too).

Here and elsewhere.

piskvorky · 2017-06-05T03:00:00Z

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The result above clearly suggests (~ 98% probability!) that the input `artificial intellegence` should belong to the category `mathematics`, which conforms very well with the expected output in this case.\n",


intellegence => intelligence

chinmayapancholi13 · 2017-06-05T17:10:10Z

@piskvorky Thanks a lot for your comprehensive feedback! :) I'd be happy to make these changes in a new PR. I'll also try to keep in mind these code-style issues in the future.

chinmayapancholi13 added 5 commits March 24, 2017 14:00

removed unnecessary keep_bocab_item import

01c3dde

removed duplicate warnings import

8797cd1

updated warning message for trim_rule

d09442c

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

9dc3b3e

…into develop

added keras wrapper for word2vec

e14bc91

chinmayapancholi13 added 9 commits March 31, 2017 07:45

added tests for training and embedding layer

01eeb56

added newline at EOF

40f4b06

keras dependency

27842a9

added theano and tensorflow dependency

913b2cf

updated test file

ce7ef3d

added ipython notebook

ea5225e

added 20NewsGroups example in tests

cbb213c

added 20NewsGroups example in ipynb

cebb3ae

added param train_embeddings

1ebca5d

tmylk reviewed May 2, 2017

View reviewed changes

chinmayapancholi13 added 4 commits May 17, 2017 18:37

moved get_embedding_layer function to keyedvectors

b7c3e2d

updated ipynb for keras integration and added extra line after class …

123f5d7

…definition

PEP8 changes in code

d682180

more PEP8 changes

54a9d97

deleted data in and updated docstring in __init__.py

1e3c59e

added another classification example in ipynb

031400c

chinmayapancholi13 added 5 commits May 29, 2017 00:53

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

042d674

…into keras_wrapper_word2vec

changes as per feedback from @tmylk

6ca5c52

updated assert statement for 'testEmbeddingLayerCosineSim'

860deca

update test_env in setup.py

45cad0a

catching ImportError properly

6d7a3e2

resolved merge conflict in setup.py

36210cc

tmylk reviewed May 30, 2017

View reviewed changes

removed 'keras_integration' folder

448e350

tmylk reviewed May 30, 2017

View reviewed changes

chinmayapancholi13 added 3 commits May 30, 2017 12:20

added note in 'get_embedding_layer' about mem usage

d06f2f4

removed Tensorflow and Theano from 'test_env' in setup.py

af94bfe

reduced number of epochs for 20NewsGroups example to reduce test time

405e556

chinmayapancholi13 changed the title ~~[WIP] Keras wrapper for Word2Vec model in Gensim~~ [MRG] Keras wrapper for Word2Vec model in Gensim Jun 1, 2017

removed theano, tensorflow install commands

d1251e5

added tensorflow installation in .travis.yml

bc198d7

menshikh-iv merged commit 7e74d15 into piskvorky:develop Jun 4, 2017

piskvorky reviewed Jun 5, 2017

View reviewed changes

menshikh-iv added the style checking label Jun 5, 2017

This was referenced Jun 6, 2017

Code-style changes in PR#1248 #1394

Merged

[WIP] Updating Gensim's Word2vec-Keras integration stephenhky/PyShortTextCategorization#7

Merged

menshikh-iv removed the style checking label Jun 14, 2017

Uh oh!

Conversation

chinmayapancholi13 commented Mar 30, 2017

Uh oh!

chinmayapancholi13 commented Mar 30, 2017

Uh oh!

tmylk commented Mar 30, 2017

Uh oh!

chinmayapancholi13 commented Mar 30, 2017

Uh oh!

chinmayapancholi13 commented Apr 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmylk commented May 2, 2017

Uh oh!

chinmayapancholi13 commented May 3, 2017

Uh oh!

tmylk commented May 18, 2017

Uh oh!

chinmayapancholi13 commented May 18, 2017

Uh oh!

chinmayapancholi13 commented May 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmylk commented Jun 2, 2017

Uh oh!

menshikh-iv commented Jun 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chinmayapancholi13 commented Jun 3, 2017

Uh oh!

menshikh-iv commented Jun 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chinmayapancholi13 commented Jun 3, 2017

Uh oh!

menshikh-iv commented Jun 4, 2017

Uh oh!

menshikh-iv commented Jun 4, 2017

Uh oh!

piskvorky left a comment

Choose a reason for hiding this comment

Uh oh!

piskvorky Jun 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piskvorky Jun 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piskvorky Jun 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

menshikh-iv commented Jun 3, 2017 •

edited

Loading

menshikh-iv commented Jun 3, 2017 •

edited

Loading

piskvorky Jun 5, 2017 •

edited

Loading

piskvorky Jun 5, 2017 •

edited

Loading

piskvorky Jun 5, 2017 •

edited

Loading

piskvorky Jun 5, 2017 •

edited

Loading