[MRG] Keras wrapper for Word2Vec model in Gensim#1248
Conversation
|
@tmylk I have tried to use the wrapper for a smaller version of the 20NewsGroup task. The code used is as follows. (It is based on the code used here) from __future__ import print_function
import os
import sys
import numpy as np
from gensim.sklearn_integration.keras_wrapper_gensim_word2vec import KerasWrapperWord2VecModel
from gensim.models import word2vec
from keras.engine import Input
from keras.layers import merge
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
BASE_DIR = ''
TEXT_DATA_DIR = BASE_DIR + './path/to/text/data/dir'
MAX_SEQUENCE_LENGTH = 1000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2
BATCH_SIZE = 128
# prepare text samples and their labels
texts = [] # list of text samples
labels_index = {} # dictionary mapping label name to numeric id
labels = [] # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
path = os.path.join(TEXT_DATA_DIR, name)
if os.path.isdir(path):
label_id = len(labels_index)
labels_index[name] = label_id
for fname in sorted(os.listdir(path)):
if fname.isdigit():
fpath = os.path.join(path, fname)
if sys.version_info < (3,):
f = open(fpath)
else:
f = open(fpath, encoding='latin-1')
t = f.read()
i = t.find('\n\n') # skip header
if 0 < i:
t = t[i:]
texts.append(t)
f.close()
labels.append(label_id)
# vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(labels))
# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]
# train the embedding matrix
data1 = word2vec.LineSentence('./path/to/input/data')
Keras_w2v = KerasWrapperWord2VecModel(data1, min_count=1)
embedding_layer = Keras_w2v.get_embedding_layer()
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x) # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
optimizer='rmsprop',
metrics=['acc'])
model.fit(x_train, y_train, validation_data=(x_val, y_val), batch_size=BATCH_SIZE) |
|
Please add unit tests and an ipynb |
|
@tmylk Sure. |
|
@tmylk I have added unit tests for word similarity task (using cosine distance) as well as a smaller version of the 20NewsGroups classification task. I have also created an IPython notebook for the wrapper explaining both these examples. |
| Addresses of Atheist Organizations | ||
| USA | ||
| FREEDOM FROM RELIGION FOUNDATION | ||
| Darwin fish bumper stickers and assorted other atheist paraphernalia are available from the Freedom From Religion Foundation in the US. |
There was a problem hiding this comment.
Please put data into test/test_data
There was a problem hiding this comment.
@tmylk The data used in the unit tests is present in test/test_data folder already. The data used in the IPython notebooks is present in docs/notebooks/datasets folder. We are using the same data at both places so to avoid unnecessary duplication, should I use the data in test/test_data (i.e. set the path accordingly in the ipynb) in the ipynb notebooks as well?
There was a problem hiding this comment.
yes, test/test_data is a better location for data used in both
There was a problem hiding this comment.
Got it. So, I'll change the path set in the ipynb for this functionality.
| # | ||
| # Copyright (C) 2011 Radim Rehurek <radimrehurek@seznam.cz> | ||
| # Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
| """Keras wrappers for gensim. |
There was a problem hiding this comment.
"Wrapper to allow gensim word2vec as input into Keras." is more clear. And in other docstrings and ipynbs too
There was a problem hiding this comment.
@tmylk The file __init__.py is common for the entire folder gensim/keras_integration (which in turn would have the files for integration of the various models with Keras). So shouldn't the docstring here be more generic (like it already is)?
There was a problem hiding this comment.
It can be more general, just pointing out the difference in meaning between "keras wrapper for gensim" vs "gensim wrapper for keras". Which one is it?
I think it is a "Wrapper to allow gensim models as input into Keras."
There was a problem hiding this comment.
Okay. Understood it now. Thanks for pointing this out. So, I'll change "Keras wrappers for gensim" to "Wrappers to allow gensim models as input into Keras".
|
Thanks for the feature. The change seems to be so small - just 5 lines in Instead of creating a new class, could you please just add this one method to |
|
@tmylk Sure. I'll add the function in |
|
@chinmayapancholi13 this is more appropriate to have in KeyedVectors class. |
|
@tmylk I have moved the function |
|
@tmylk Thanks a lot for your feedback. I have incorporated your suggestions as follows :
|
| @@ -0,0 +1,8 @@ | |||
| #!/usr/bin/env python | |||
| # -*- coding: utf-8 -*- | |||
There was a problem hiding this comment.
This file and folder are no longer needed
There was a problem hiding this comment.
Right! Removing this folder right away.
| 'scikit-learn', | ||
| 'pyemd', | ||
| 'annoy', | ||
| 'theano', |
There was a problem hiding this comment.
would only one of them be enough? tf is preferred
|
Let's add a note about shorttext to ipynb. And then LGTM |
|
Need to fix problems with travis (so strange, because keras is installed, but in log I see |
|
@menshikh-iv This is because Keras also needs Tensorflow to be installed (when Tensorflow is the default backend). And we are not installing Tensorflow in Thus, we(similar to Keras) expect the users to install Tensorflow by themselves. |
|
@chinmayapancholi13 Thank for your clarification, but what will we do with Travis? |
|
@menshikh-iv One solution can be to include TF in the installation and add a note that users would have to re-install the TF of their choice again (i.e. in case it gets overwritten). For the tests to get passed in travis, I believe we must install TF somehow so this can be a solution in my opinion. |
|
@chinmayapancholi13 Sounds good for me, let's do it. |
|
@chinmayapancholi13 Great 🥇 |
piskvorky
left a comment
There was a problem hiding this comment.
Thanks for the interesting feature and notebook!
There are some code style issues -- can you fix that? @menshikh-iv
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "Then, we call the wrapper and pass appropriate parameters." |
There was a problem hiding this comment.
What wrapper?
Here and below, the text refers to some "wrapper", but I see no wrapper.
| "source": [ | ||
| "word_a = 'graph'\n", | ||
| "word_b = 'trees'\n", | ||
| "output = keras_model.predict([np.asarray([model.wv.vocab[word_a].index]), np.asarray([model.wv.vocab[word_b].index])]) # output is the cosine distance between the two words (as a similarity measure)\n", |
There was a problem hiding this comment.
The comment would be better moved after the code, on a separate line, to improve readability.
| "source": [ | ||
| "# global variables\n", | ||
| "\n", | ||
| "nb_filters=1200 # number of filters\n", |
There was a problem hiding this comment.
PEP8: space around the assignment operator (x = 1).
Here and in my other places in the notebook.
| " category_col, descp_col = df.columns.values.tolist()\n", | ||
| " shorttextdict = defaultdict(lambda : [])\n", | ||
| " for category, descp in zip(df[category_col], df[descp_col]):\n", | ||
| " if type(descp)==str:\n", |
There was a problem hiding this comment.
Will this work across Python 2 / Python 3?
There was a problem hiding this comment.
Yes. It is working fine for both Python 2 and 3.
| " shorttextdict = defaultdict(lambda : [])\n", | ||
| " for category, descp in zip(df[category_col], df[descp_col]):\n", | ||
| " if type(descp)==str:\n", | ||
| " shorttextdict[category] += [descp]\n", |
There was a problem hiding this comment.
append simpler, faster and more readable?
Also, if you need to convert to plain dict anyway below, a plain setdefault(key, []).append(x) may be easier than defaultdict.
| " \"\"\"\n", | ||
| " df = pd.read_csv(filepath)\n", | ||
| " category_col, descp_col = df.columns.values.tolist()\n", | ||
| " shorttextdict = defaultdict(lambda : [])\n", |
| " Return an example data set, with three subjects and corresponding keywords.\n", | ||
| " This is in the format of the training input.\n", | ||
| " \"\"\"\n", | ||
| " data_path = './datasets/keras_classifier_training_data.csv'\n", |
There was a problem hiding this comment.
os.path.join better (will work on Windows too).
Here and elsewhere.
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "The result above clearly suggests (~ 98% probability!) that the input `artificial intellegence` should belong to the category `mathematics`, which conforms very well with the expected output in this case.\n", |
|
@piskvorky Thanks a lot for your comprehensive feedback! :) I'd be happy to make these changes in a new PR. I'll also try to keep in mind these code-style issues in the future. |
This PR adds a Keras wrapper for Word2Vec Model in Gensim.