Add `gensim.models.BaseKeyedVectors.add_entity` method for fill `KeyedVectors` in manual way. Fix #1942 by persiyanov · Pull Request #1957 · piskvorky/gensim

persiyanov · 2018-03-06T14:19:31Z

Pull request related to this issue.

With these changes, it would be possible to add new word vectors into a KeyedVectors object.

@menshikh-iv please, take a look. I will write tests on it after addressing your comments.

Moreover, I have some questions/doubts:

Method name: is it okay right now or maybe it should be named like add_entity/add_word?
Weights parameter: should we check the type and shape of weights and raise an exception in a bad case?
self.vectors contiguity: here vectors list is casted to C-contiguous array. Does numpy preserve C-contiguity after operations such as np.vstack? If so, I think I should add numpy.ascontiguousarray cast in my code.

menshikh-iv · 2018-03-07T05:15:17Z

Hello @persiyanov

I think both (add_word & add_entity) is OK for this case
I don't think so (because you have no predefined length if you'll create the empty class). Anyway, this needs only for some special cases.

menshikh-iv · 2018-03-07T05:16:58Z

        else:
            raise KeyError("'%s' not in vocabulary" % entity)

+    def add(self, entity, weights):


What's about re-using this function in (this is duplication right now from https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L182)?

Also, need to add tests to check this functionality:

load some kv, add more vectors in a manual way and check that this added fine

create empty kv, fill it manually and check that all fine

@menshikh-iv

About reusing this function:

It's a bit difficult because in this function vstack is used to append new word vector to self.vectors, while add_word in utils_any2vec creates vectors array at first (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L180) and then it just inserts vectors into it (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L197).

While it's possible to follow DRY here, the interface of BaseKeyedVectors.add() method will be more complicated (or I can change the logic in utils_any2vec -- not to create vectors = np.zeros(...) but append each word to the array, but it could decrease the performance of load_word2vec_format function).

If some of these two options is okay, I'll implement it.

Aha, thanks for suggestion, let's stay it as is.

menshikh-iv · 2018-03-08T05:40:00Z


+    def test_add_word(self):
+        """Test that adding word in a manual way works correctly."""
+        from numpy.random import randn


can you remove import and use np.random.randn

menshikh-iv · 2018-03-08T05:41:59Z

            raise KeyError("'%s' not in vocabulary" % entity)

+    def add_entity(self, entity, weights):
+        """Accept an entity specified by string tag and vector weights as 1D numpy array with shape (`vector_size`,).


Please use numpy-style docstrings

menshikh-iv · 2018-03-08T05:42:40Z

+        self.vocab[entity] = Vocab(index=entity_id, count=1)
+        self.index2entity.append(entity)
+
+    def add_word(self, entity, weights):


Maybe this doesn't need?

menshikh-iv · 2018-03-08T05:44:49Z

CC: @gojomo wdyt?

gojomo · 2018-03-08T20:39:57Z

This sort of functionality is a natural and useful addition; ideally it'd also be joined with other new APIs to assist initial building of a KeyedVectors from external data sources (rather than the just the direct property-tampering of current usage).

There should be a bulk addition option: otherwise doing lots of single-adds in a loop requires many wasteful reallocations of 1-vector-larger-arrays.

Should perhaps use __setitem__() rather than a named method, or at least have that as an idiomatic option.

Longer-term, a KeyedVectors that actually splits its contents into multiple segments of smaller arrays for more efficient add/delete grow/shrinks would also make sense. But, that'd require more hiding of internal implementation details in a way that could break lots of the current direct-property-accesses by other code.

re: @persiyanov your questions - (1) I'd avoid 'word' in any method names, to stay loyal to intent that this class accepts keys other than just words. (2) if the vstack() exception which results from attempting a mismatched shape is already sufficiently descriptive, no need for extra checking here... but if it's unclear/confusing, then a compat-shape-check would make sense. (3) I don't know but that's worth checking.

…ctionality

persiyanov · 2018-03-09T13:35:42Z

@menshikh-iv @gojomo please, take a look.

I've implemented add_entities method and its alias __setitem__. Also, I've added bool flag replace which specifies vectors replacement strategy for those entities which are already in vocabulary.

persiyanov · 2018-03-10T22:36:28Z

Ping @gojomo @menshikh-iv

menshikh-iv

Looks good to me @persiyanov: +1:
I have only several nitpicks about docstrings + not sure about add_entity.

menshikh-iv · 2018-03-12T03:45:06Z

+        ----------
+        entities : list of str
+            Entities specified by string tags.
+        weights: list of np.array or np.array


{list of numpy.ndarray, numpy.ndarray}

menshikh-iv · 2018-03-12T03:46:06Z

        else:
            raise KeyError("'%s' not in vocabulary" % entity)

+    def add_entity(self, entity, weights, replace=False):


maybe remove this method (add_entities looks enough, wdyt @gojomo?)

menshikh-iv · 2018-03-12T03:48:33Z

+            List of 1D np.array vectors or 2D np.array of vectors.
+        replace: bool, optional
+            Boolean flag indicating whether to replace vectors for entities which are already in the vocabulary.
+            Default, False, means that old vectors for those entities are keeped.


No need to duplicate "default" value for trivial case in docstring, maybe better to write something like

Flag indicating whether to replace vectors for entities which are already in the vocabulary, if True - replace vectors, otherwise - keep old vectors.

menshikh-iv · 2018-03-12T03:52:38Z

+            self.vectors[in_vocab_idxs] = weights[in_vocab_mask]
+
+    def __setitem__(self, entities, weights):
+        """Idiomatic way to call `add_entities` with `replace=True`.


better to write full docstring

menshikh-iv · 2018-03-12T14:26:06Z

@gojomo please have a look (for me LGTM, except add_entity method, because I don't see any reason for this alias).

persiyanov · 2018-03-13T23:39:42Z

ping @gojomo

gojomo · 2018-03-14T18:55:35Z

+        in_vocab_idxs = []
+        out_vocab_entities = []
+
+        for idx, entity in zip(range(len(entities)), entities):


enumerate() more idiomatic.

gojomo · 2018-03-14T19:01:13Z

+        if len(self.vectors) == 0:
+            self.vectors = weights[~in_vocab_mask]
+        else:
+            self.vectors = vstack((self.vectors, weights[~in_vocab_mask]))


Might this line work even in the case where len(self.vectors)==0, making the check/branch unnecessary?

I think it's not obvious how to do that, because when empty KeyedVectors object is created, self.vectors = [] is true. In that case, we can't use vstack(([], weights[~in_vocab_mask])) and ValueError: all the input array dimensions except for the concatenation axis must match exactly is raised.

Is it possible for an empty KeyedVectors to have a self.vectors that is already a proper-dimensioned (0, vector_size) empty ndarray? (Not sure myself, but would simplify things in later places like this.)

gojomo · 2018-03-14T19:02:43Z

+
+        in_vocab_mask = np.zeros(len(entities), dtype=np.bool)
+        in_vocab_idxs = []
+        out_vocab_entities = []


This method might be simpler without separate in_vocab_idxs and out_vocab_entities – just driving those ops from the mask, using options like where() or nonzero().

gojomo · 2018-03-14T19:15:16Z

Functionality seems good.

I think class has a preexisting terminology issue with the use of word 'entity' where often 'key' would be more-consistent/more-specific. Also, as the 'items' inside this are definitionally 'vectors', generally vector/vectors better terms than weights . While I'm not sure this PR can/should fix all of that, I'd prefer add_entity() & add_entities() be replaced with a single add(keys, vectors, replace=True) (that could also for convenience tolerate a single key/vector).

persiyanov · 2018-03-16T12:15:57Z

@gojomo @menshikh-iv please, take a look.

I've removed add_entities & add_entity but kept add method which can be used for all cases. Also, I got rid of out_vocab_entities variable, doing several np.nonzero(...) calls.

I also think that operating with keys/vectors instead of entities/weights is better, but it's not related to this task and better to be done in another PR.

menshikh-iv

Looks good to me, great work @persiyanov 👍

menshikh-iv · 2018-03-16T13:49:34Z

+        replace: bool, optional
+            Flag indicating whether to replace vectors for entities which are already in the vocabulary,
+            if True - replace vectors, otherwise - keep old vectors.
+        """


nitpick: multiline docstring should ends with empty line, i.e.

""" ... last text """

menshikh-iv · 2018-03-16T13:51:12Z

+        """
+        if isinstance(entities, string_types):
+            entities = [entities]
+            weights = weights.reshape(1, -1)


probably, should be weights = np.array(weights).reshape(1, -1) for case if weights, for example, list of floats

persiyanov · 2018-03-16T14:22:10Z

@menshikh-iv Fixed

gojomo · 2018-03-19T18:24:40Z


 from numpy import dot, zeros, float32 as REAL, empty, memmap as np_memmap, \
-    double, array, vstack, sqrt, newaxis, integer, \
+    double, array, zeros, vstack, sqrt, newaxis, integer, \


zeros imported twice

@gojomo yeah, i've fixed it

menshikh-iv · 2018-03-20T07:06:00Z

Congratz with first time contribution @persiyanov 👍 🥇

gojomo · 2018-03-20T21:28:27Z

@persiyanov Yes, and thanks for your patience through all the subtle refinements!

piskvorky · 2018-03-27T14:48:29Z

Is the word entity still in? I'm strongly -1 on that, we should not merge that.

"Entity" has an established meaning in NLP, it comes with certain connotations. Introducing a new concept into gensim like this muddles the waters. Especially since the concept is not really introduced at all in the PR AFAICS—what is an "entity" here? I only skimmed the docstrings and they only mention the word, never explain it. This is confusing and inconsistent with our other docs.

gojomo · 2018-03-27T21:34:19Z

The widespread use of 'entity' was introduced in #1777; this PR is just following the code's existing practice. The code desperately needs a refactoring for consistency/clarity, even moreso since the #1777 attempt!

menshikh-iv · 2018-03-27T21:37:11Z

@piskvorky this modification for BaseKeyedVectors class (that doesn't say anything about words, only abstract "entities" that can be "words" in the subclasses).

Check out other code from this class, for example

https://github.com/RaRe-Technologies/gensim/blob/2e08f4d3b218c9675d4f842f724af40a4f4ec1ee/gensim/models/keyedvectors.py#L124-L144

BaseKeyedVectors used as base class for PoincareKeyedVectors (where we have no "words")
https://github.com/RaRe-Technologies/gensim/blob/2e08f4d3b218c9675d4f842f724af40a4f4ec1ee/gensim/models/poincare.py#L772

piskvorky · 2018-03-29T12:49:19Z

I agree with @gojomo , the code desperately needs a proper refactoring. The current post-#1777 situation seems untenable. @gojomo any chance you could take this up?

@menshikh-iv your snippet doesn't explain entity in any way, so not sure how that's helpful. I'm strongly -1 on introducing new fundamental concepts in this way.

menshikh-iv · 2018-03-29T13:12:14Z

@piskvorky I don't explain it, I just show that the current code mimics the already existing and doesn't introduce any new concepts.

piskvorky · 2018-03-29T13:54:04Z

Understood. It's an issue with #1777 , not this PR.

persiyanov added 2 commits March 6, 2018 16:47

Introduce BaseKeyedVectors.add(...) method

99bcf44

make default count=1

06955c4

menshikh-iv changed the title ~~[Fixes #1942]: Introduce BaseKeyedVectors.add(...) method~~ Add BaseKeyedVectors.add for fill kv in manual way. Fix #1942 Mar 7, 2018

menshikh-iv suggested changes Mar 7, 2018

View reviewed changes

add test on add_word method

089d346

menshikh-iv suggested changes Mar 8, 2018

View reviewed changes

persiyanov added 3 commits March 8, 2018 13:58

Merge branch 'develop' into feature/add-word-method-to-keyed-vectors

f428571

address @menshikh-iv comments

0aff584

fix test_keyedvectors after removing add_word alias

f6e5e79

menshikh-iv changed the title ~~Add BaseKeyedVectors.add for fill kv in manual way. Fix #1942~~ Add gensim.models.BaseKeyedVectors.add_entity method for fill KeyedVectors in manual way. Fix #1942 Mar 9, 2018

add __setitem__, add bulk entities processing + some tests on new fun…

d4b0ffe

…ctionality

menshikh-iv suggested changes Mar 12, 2018

View reviewed changes

addressing @menshikh-iv comments on docstrings

912d462

Merge branch 'develop' into feature/add-word-method-to-keyed-vectors

3611320

gojomo reviewed Mar 14, 2018

View reviewed changes

addressing @gojomo comments

437a142

menshikh-iv reviewed Mar 16, 2018

View reviewed changes

adrressing nitpicks

737cd36

persiyanov added 2 commits March 19, 2018 20:46

make self.vectors = np.zeros((0, vector_size)) by default

070fbed

fix pep8

2294c07

gojomo reviewed Mar 19, 2018

View reviewed changes

menshikh-iv merged commit 58d560b into piskvorky:develop Mar 20, 2018

persiyanov deleted the feature/add-word-method-to-keyed-vectors branch March 21, 2018 11:12

gojomo mentioned this pull request Sep 21, 2020

[MRG] *2Vec SaveLoad improvements #2939

Merged

Uh oh!

Conversation

persiyanov commented Mar 6, 2018

Uh oh!

menshikh-iv commented Mar 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

persiyanov Mar 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

menshikh-iv commented Mar 8, 2018

Uh oh!

gojomo commented Mar 8, 2018

Uh oh!

persiyanov commented Mar 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

persiyanov commented Mar 10, 2018

Uh oh!

menshikh-iv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

menshikh-iv commented Mar 12, 2018

Uh oh!

persiyanov commented Mar 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gojomo commented Mar 14, 2018

Uh oh!

persiyanov commented Mar 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

menshikh-iv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

persiyanov commented Mar 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

menshikh-iv commented Mar 20, 2018

Uh oh!

gojomo commented Mar 20, 2018

Uh oh!

piskvorky commented Mar 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

persiyanov Mar 7, 2018 •

edited

Loading

persiyanov commented Mar 9, 2018 •

edited

Loading

persiyanov commented Mar 16, 2018 •

edited

Loading

piskvorky commented Mar 27, 2018 •

edited

Loading

piskvorky commented Mar 29, 2018 •

edited

Loading

menshikh-iv commented Mar 29, 2018 •

edited

Loading