Add gensim.models.BaseKeyedVectors.add_entity method for fill KeyedVectors in manual way. Fix #1942#1957
Conversation
|
Hello @persiyanov
|
BaseKeyedVectors.add for fill kv in manual way. Fix #1942
| else: | ||
| raise KeyError("'%s' not in vocabulary" % entity) | ||
|
|
||
| def add(self, entity, weights): |
There was a problem hiding this comment.
What's about re-using this function in (this is duplication right now from https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L182)?
There was a problem hiding this comment.
Also, need to add tests to check this functionality:
- load some kv, add more vectors in a manual way and check that this added fine
- create empty kv, fill it manually and check that all fine
There was a problem hiding this comment.
About reusing this function:
It's a bit difficult because in this function vstack is used to append new word vector to self.vectors, while add_word in utils_any2vec creates vectors array at first (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L180) and then it just inserts vectors into it (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L197).
While it's possible to follow DRY here, the interface of BaseKeyedVectors.add() method will be more complicated (or I can change the logic in utils_any2vec -- not to create vectors = np.zeros(...) but append each word to the array, but it could decrease the performance of load_word2vec_format function).
If some of these two options is okay, I'll implement it.
There was a problem hiding this comment.
Aha, thanks for suggestion, let's stay it as is.
|
|
||
| def test_add_word(self): | ||
| """Test that adding word in a manual way works correctly.""" | ||
| from numpy.random import randn |
There was a problem hiding this comment.
can you remove import and use np.random.randn
| raise KeyError("'%s' not in vocabulary" % entity) | ||
|
|
||
| def add_entity(self, entity, weights): | ||
| """Accept an entity specified by string tag and vector weights as 1D numpy array with shape (`vector_size`,). |
| self.vocab[entity] = Vocab(index=entity_id, count=1) | ||
| self.index2entity.append(entity) | ||
|
|
||
| def add_word(self, entity, weights): |
There was a problem hiding this comment.
Maybe this doesn't need?
|
CC: @gojomo wdyt? |
|
This sort of functionality is a natural and useful addition; ideally it'd also be joined with other new APIs to assist initial building of a There should be a bulk addition option: otherwise doing lots of single-adds in a loop requires many wasteful reallocations of 1-vector-larger-arrays. Should perhaps use Longer-term, a re: @persiyanov your questions - (1) I'd avoid 'word' in any method names, to stay loyal to intent that this class accepts keys other than just words. (2) if the |
BaseKeyedVectors.add for fill kv in manual way. Fix #1942gensim.models.BaseKeyedVectors.add_entity method for fill KeyedVectors in manual way. Fix #1942
|
@menshikh-iv @gojomo please, take a look. I've implemented |
|
Ping @gojomo @menshikh-iv |
menshikh-iv
left a comment
There was a problem hiding this comment.
Looks good to me @persiyanov: +1:
I have only several nitpicks about docstrings + not sure about add_entity.
| ---------- | ||
| entities : list of str | ||
| Entities specified by string tags. | ||
| weights: list of np.array or np.array |
There was a problem hiding this comment.
{list of numpy.ndarray, numpy.ndarray}
| else: | ||
| raise KeyError("'%s' not in vocabulary" % entity) | ||
|
|
||
| def add_entity(self, entity, weights, replace=False): |
There was a problem hiding this comment.
maybe remove this method (add_entities looks enough, wdyt @gojomo?)
| List of 1D np.array vectors or 2D np.array of vectors. | ||
| replace: bool, optional | ||
| Boolean flag indicating whether to replace vectors for entities which are already in the vocabulary. | ||
| Default, False, means that old vectors for those entities are keeped. |
There was a problem hiding this comment.
No need to duplicate "default" value for trivial case in docstring, maybe better to write something like
Flag indicating whether to replace vectors for entities which are already in the vocabulary, if True - replace vectors, otherwise - keep old vectors.
| self.vectors[in_vocab_idxs] = weights[in_vocab_mask] | ||
|
|
||
| def __setitem__(self, entities, weights): | ||
| """Idiomatic way to call `add_entities` with `replace=True`. |
There was a problem hiding this comment.
better to write full docstring
|
@gojomo please have a look (for me LGTM, except |
|
ping @gojomo |
| in_vocab_idxs = [] | ||
| out_vocab_entities = [] | ||
|
|
||
| for idx, entity in zip(range(len(entities)), entities): |
There was a problem hiding this comment.
enumerate() more idiomatic.
| if len(self.vectors) == 0: | ||
| self.vectors = weights[~in_vocab_mask] | ||
| else: | ||
| self.vectors = vstack((self.vectors, weights[~in_vocab_mask])) |
There was a problem hiding this comment.
Might this line work even in the case where len(self.vectors)==0, making the check/branch unnecessary?
There was a problem hiding this comment.
I think it's not obvious how to do that, because when empty KeyedVectors object is created, self.vectors = [] is true. In that case, we can't use vstack(([], weights[~in_vocab_mask])) and ValueError: all the input array dimensions except for the concatenation axis must match exactly is raised.
There was a problem hiding this comment.
Is it possible for an empty KeyedVectors to have a self.vectors that is already a proper-dimensioned (0, vector_size) empty ndarray? (Not sure myself, but would simplify things in later places like this.)
|
|
||
| in_vocab_mask = np.zeros(len(entities), dtype=np.bool) | ||
| in_vocab_idxs = [] | ||
| out_vocab_entities = [] |
There was a problem hiding this comment.
This method might be simpler without separate in_vocab_idxs and out_vocab_entities – just driving those ops from the mask, using options like where() or nonzero().
|
Functionality seems good. I think class has a preexisting terminology issue with the use of word 'entity' where often 'key' would be more-consistent/more-specific. Also, as the 'items' inside this are definitionally 'vectors', generally |
|
@gojomo @menshikh-iv please, take a look. I've removed I also think that operating with |
menshikh-iv
left a comment
There was a problem hiding this comment.
Looks good to me, great work @persiyanov 👍
| replace: bool, optional | ||
| Flag indicating whether to replace vectors for entities which are already in the vocabulary, | ||
| if True - replace vectors, otherwise - keep old vectors. | ||
| """ |
There was a problem hiding this comment.
nitpick: multiline docstring should ends with empty line, i.e.
"""
...
last text
"""
| """ | ||
| if isinstance(entities, string_types): | ||
| entities = [entities] | ||
| weights = weights.reshape(1, -1) |
There was a problem hiding this comment.
probably, should be weights = np.array(weights).reshape(1, -1) for case if weights, for example, list of floats
|
@menshikh-iv Fixed |
|
|
||
| from numpy import dot, zeros, float32 as REAL, empty, memmap as np_memmap, \ | ||
| double, array, vstack, sqrt, newaxis, integer, \ | ||
| double, array, zeros, vstack, sqrt, newaxis, integer, \ |
|
Congratz with first time contribution @persiyanov 👍 🥇 |
|
@persiyanov Yes, and thanks for your patience through all the subtle refinements! |
|
Is the word "Entity" has an established meaning in NLP, it comes with certain connotations. Introducing a new concept into gensim like this muddles the waters. Especially since the concept is not really introduced at all in the PR AFAICS—what is an "entity" here? I only skimmed the docstrings and they only mention the word, never explain it. This is confusing and inconsistent with our other docs. |
|
@piskvorky this modification for Check out other code from this class, for example
|
|
I agree with @gojomo , the code desperately needs a proper refactoring. The current post-#1777 situation seems untenable. @gojomo any chance you could take this up? @menshikh-iv your snippet doesn't explain |
|
@piskvorky I don't explain it, I just show that the current code mimics the already existing and doesn't introduce any new concepts. |
|
Understood. It's an issue with #1777 , not this PR. |
Pull request related to this issue.
With these changes, it would be possible to add new word vectors into a
KeyedVectorsobject.@menshikh-iv please, take a look. I will write tests on it after addressing your comments.
Moreover, I have some questions/doubts:
add_entity/add_word?weightsand raise an exception in a bad case?self.vectorscontiguity: herevectorslist is casted to C-contiguous array. Does numpy preserve C-contiguity after operations such asnp.vstack? If so, I think I should addnumpy.ascontiguousarraycast in my code.