Fix method estimate_memory from gensim.models.FastText & huge performance improvement. Fix #1824#1916
Conversation
|
@jbaiter Wow! Please resolve merge-conflicts (this is critical right now). Probably, create new branch (based on fresh When you resolve conflicts - please ping me for review / any help. |
7c6afb2 to
08c464a
Compare
08c464a to
51a1a6e
Compare
|
@menshikh-iv So I managed to rebase my changes on the latest develop branch. |
| out_expected_vec = numpy.array([-1.34948218, -0.8686831, -1.51483142, -1.0164026, 0.56272298, | ||
| 0.66228276, 1.06477463, 1.1355902, -0.80972326, -0.39845538]) | ||
| out_expected_vec = numpy.array([-0.33959097, -0.21121596, -0.37212455, -0.25057459, 0.11222091, | ||
| 0.17517674, 0.26949012, 0.29352987, -0.1930912, -0.09438948]) |
There was a problem hiding this comment.
This is probably not correct, the vector seems to differ between different Python versions (the above was with 3.6)
…tPersistenceForOldVersions
|
I now reverted all changes to the deprecated On Windows there seems to be a memory-related issue, the allocation for the ngram vectors in |
|
@jbaiter thanks, for memory - this is known issue for appveyour :( I'll look into later, thanks for your patience. |
menshikh-iv
left a comment
There was a problem hiding this comment.
CC: @manneshiva can you review this one, please?
| indexes[effective_words] = word.index | ||
|
|
||
| subwords = [model.wv.ngrams[subword_i] for subword_i in model.wv.ngrams_word[model.wv.index2word[word.index]]] | ||
| subwords = [model.wv.hash2index[ft_hash(subword) % model.bucket] |
There was a problem hiding this comment.
Please use only hanging indents (no vertical).
There was a problem hiding this comment.
That is, remove the newline and keep the comprehension on a single line?
There was a problem hiding this comment.
single line (if line <= 120 characters) or something like
subwords = [
....
....
]| @@ -0,0 +1,21 @@ | |||
| #!/usr/bin/env cython | |||
There was a problem hiding this comment.
This related only with n-grams model, I think it's better to move it to fasttext_inner.pyx
There was a problem hiding this comment.
I don't think this would work, since both functions are used by both models.fasttext and models.keyedvectors. Moving the functions into models.fasttext would break, since this would cause a circular import.
There was a problem hiding this comment.
Aha, so, in this case, better to name it as _utils_any2vec.pyx (similar with _mmreader.pyx and _matutils.pyx)
| # cython: cdivision=True | ||
| # coding: utf-8 | ||
|
|
||
| def ft_hash(unicode string): |
There was a problem hiding this comment.
why not cdef with nogil?
There was a problem hiding this comment.
Both functions need to be called from both Python and Cython, so cdef won't work, but cpdef should. Will be fixed.
There was a problem hiding this comment.
nogil doesn't work, since the function returns a Python object.
| return h | ||
|
|
||
|
|
||
| cpdef compute_ngrams(word, unsigned int min_n, unsigned int max_n): |
There was a problem hiding this comment.
why not nogil, you can fix type for word.
There was a problem hiding this comment.
Fixing the type for word doesn't really work, since the token might be a str on Python 2.7. nogil won't work either, since the function returns a Python object.
|
|
||
|
|
||
| cpdef compute_ngrams(word, unsigned int min_n, unsigned int max_n): | ||
| cdef unicode extended_word = f'<{word}>' |
There was a problem hiding this comment.
f-strings supports only in 3.6 (and we maintain 2.7, 3.5, 3.6), please use simple concatenation (or any alternative) here.
There was a problem hiding this comment.
This is in Cython, which, as far as I understood, since 0.24 automatically generates cross-compatible C-code for it. It works fine under 2.7.
There was a problem hiding this comment.
wow, really? I didn't know about it, thanks for the information!
There was a problem hiding this comment.
You can read about it here: http://cython.readthedocs.io/en/latest/src/tutorial/strings.html#string-literals
| model_neg.build_vocab(new_sentences, update=True) # update vocab | ||
| model_neg.train(new_sentences, total_examples=model_neg.corpus_count, epochs=model_neg.iter) | ||
| self.assertEqual(len(model_neg.wv.vocab), 14) | ||
| self.assertTrue(len(model_neg.wv.ngrams), 271) |
There was a problem hiding this comment.
Why you removed part of tests (I mean all removed lines in tests)?
There was a problem hiding this comment.
As I mentioned in the PR, one optimization was to remove the storage of ngrams on the model and solely rely on the hashes. This is why any tests that assert the number of ngrams in the model are no longer neccesary.
With the lack of ngrams, the __contains__ check now also only uses the hashed and bucketed ngrams, which is why a 'real' OOV is a lot rarer (i.e. it will only happen if not all buckets are occupied and none of the ngrams in the token match any occupied bucket).
|
@jbaiter don't forget to resolve merge-conflict too |
manneshiva
left a comment
There was a problem hiding this comment.
@jbaiter I went through your PR and it looks good to me. Great job!
Just one more small deletion required in /gensim/models/deprecated/fasttext.py. Please delete,
new_model.wv.ngrams_word = old_model.wv.ngrams_word
new_model.wv.ngrams = old_model.wv.ngrams
and the corresponding asserts in test_fasttext.py, here.
I also feel we should also evaluate the effect of the changes in this PR on the quality of vectors learnt. Maybe compare the old and new code by training a FastText model on text8 and looking at the accuracies (using accuracy) of learnt vectors on question-answers.txt. It would also be interesting to see the memory consumption in both cases.
cc: @menshikh-iv
This removes the expensive calls to `compute_ngrams` and `ft_hash` during training and uses a simple lookup in an int -> int[] mapping instead, resulting in a dramatic increase in training performance.
|
I ran some benchmarks with my optimized version and the current gensim implementation of FastText. Initially the performance was about 10x slower, but I implemented an optimization that pre-generates the ngram buckets for each word to avoid calling I trained on the
As you can see, the goal of reducing memory consumption was achieved and we additionally almost doubled the training speed, while maintaining evaluation speed. The quality of the vectors seems to suffer a bit, however I think that this might be due to the different random initializations of the two models.
|
|
@jbaiter first table looks awesome: almost x2 faster + reduce memory usage, fantastic 🔥! About second table & random init: can you train/evaluate several times & average results (to exclude random effect) please? Also, please have a look at Appveyor, I see |
|
@jbaiter The speedup looks great! Thanks for this contribution.
|
|
@manneshiva @jbaiter maybe try to compare on bigger corpus (something ~1GB, not |
Will do, I'll also do a run each with a fixed seed.
Do you have any idea what could be causing this?
Can you recommend a suitable dataset that works with the tests in
Will do!
Yes, the values for subwords = [model.wv.ngrams[subword_i] for subword_i in model.wv.ngrams_word[model.wv.index2word[word.index]]]vs subwords = model.wv.buckets_word[word.index]The current version does |
This happens sometimes with appveyour (by memory limit reasons), so, you can try to use the smaller model in this case.
Sample from https://github.com/RaRe-Technologies/gensim-data/releases/tag/wiki-english-20171001 should be a good idea (you should pick first 1M articles). |
|
I'll make benchmark myself too ( Code: https://gist.github.com/menshikh-iv/ba8cba26744c668e73b59d5972dabbf8 from gensim.parsing.preprocessing import (
preprocess_string, strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short
)
prc = partial(
preprocess_string,
filters=[strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short]
)Model parameters: almost default, only
|
|
So I ran the As for the causes of the speedup, it seems that my changes somehow result in a significant increase in parallelism, as can be gathered from these performance counters: Perfomance counters for training + evaluation on current Performance counters for training + evaluation on optimized Additionally to using a full core more, IPC seems to be higher (1.32 instructions/cycle vs 1.12 before) and the overall number of instructions is lower (could be because of the lower number of lookups?) |
|
I checked current PR with 500,000 articles from wiki, my results #1916 (comment), this looks exciting 🌟 🔥 impressive work @jbaiter! I also checked, the model from |
|
@jbaiter last things that missed here
|
|
This is files from profiler (current relese version + PR version) Script: import gensim.downloader as api
from gensim.models import FastText
data = api.load("text8")
model = FastText(data)@manneshiva this is for you |
estimate_memory from gensim.models.FastText & huge performance improvement. Fix #1824
| wv.hash2index[ngram_hash] = new_hash_count | ||
| wv.ngrams[ngram] = wv.hash2index[ngram_hash] | ||
| new_hash_count = new_hash_count + 1 | ||
| wv.num_ngram_vectors = 0 |
There was a problem hiding this comment.
We could probably reduce some variables here - there seems to be some redundancy, if I understand correctly. wv.num_ngram_vectors, new_hash_count and len(ngram_indices) serve effectively the same purpose.
Maybe we could use len(ngram_indices) within the loop and set wv.num_ngram_vectors at the end of the loop?
| new_ngrams = list(set(new_ngrams)) | ||
| wv.num_ngram_vectors += len(new_ngrams) | ||
| logger.info("Number of new ngrams is %d", len(new_ngrams)) | ||
| if not wv.buckets_word: |
There was a problem hiding this comment.
What is the purpose of this?
| new_hash_count = new_hash_count + 1 | ||
| else: | ||
| wv.ngrams[ngram] = wv.hash2index[ngram_hash] | ||
| num_new_ngrams = 0 |
There was a problem hiding this comment.
There seems to be some redundancy again with new_hash_count, num_new_ngrams.
| continue | ||
| ngram_indices.append(len(wv.vocab) + ngram_hash) | ||
| wv.hash2index[ngram_hash] = wv.num_ngram_vectors | ||
| wv.num_ngram_vectors += 1 |
There was a problem hiding this comment.
Can be set to len(ngram_indices) at the end instead (sorry for nitpicking, but we already have very long code for some of these methods)
| if ngram_hash in self.hash2index: | ||
| word_vec += ngram_weights[self.hash2index[ngram_hash]] | ||
| if word_vec.any(): | ||
| return word_vec / len(ngrams) |
There was a problem hiding this comment.
This probably needs to be updated to only take into account the ngrams for which hashes were present in self.hash2index
| 0.58818, | ||
| 0.57828, | ||
| 0.75801 | ||
| -0.21929, |
There was a problem hiding this comment.
What is the reason for this change? Could it because of the len(ngrams) issue mentioned in a comment above?
There was a problem hiding this comment.
That's really a crucial question, but it's not because of the len(ngrams) issue.
The change was honestly simply made because the numbers were pretty similar and I thought that the vectors just changed a bit since the new code is far more lenient in assigning a vector to unknown ngrams (i.e. once all buckets are occupied, any ngram will result in a vector, even if it was not in the original corpus).
But it looks like there might a bug in the old code that has something to do with this: There are a lot more ngram vectors in the loaded model (17004) than there are in the model on disk (2762). This is probably because wv.vectors_ngram = wv.vectors_ngrams.take(ngram_indices, axis=0) in init_ngrams_post_load will result in a (num_ngrams_total, ngram_vec_len) matrix. Shouldn't vectors_ngram have a shape of (num_buckets, ngram_vec_len)? At least that's the case in the new code, and it follows from my (not necessarily correct) understanding of how the bucketing in this implementation works.
This sound similar to what was reported in #1779
There was a problem hiding this comment.
...since the new code is far more lenient in assigning a vector to unknown ngrams (i.e. once all buckets are occupied, any ngram will result in a vector, even if it was not in the original corpus).
Ahh right, that makes sense, thanks for explaining.
Re: the number of ngram vectors being greater than num_buckets (or the number of vectors on disk) - I see why that might have been happening. With a ngram vocab larger than the number of buckets, a lot of ngrams will be mapped to the same indices. And when .take is passed a list that contains multiple occurrences of the same index, the vector at that index is "taken" multiple times.
For example -
all_vectors = np.array([[0.1, 0.3], [0.3, 0.1]])
taken_vectors = all_vectors.take([0, 1, 0], axis=0)
taken_vectors.shape
>>> (3, 2)So it wouldn't result in incorrect results, but yeah, it'd result in unexpectedly high memory usage (and kind of blowing out the whole idea of keeping memory usage constant even with increasing ngram vocabs, out of the water). Thanks for investigating this and explaining it!
@menshikh-iv we're good to merge from my side
| 0.18025, | ||
| -0.14128, | ||
| 0.22508 | ||
| -0.49111, |
There was a problem hiding this comment.
Ditto:
What is the reason for this change? Could it because of the len(ngrams) issue mentioned in a comment above?
|
Some missed stuff
|
This PR is an attempt to optimize the memory usage of the FastText model and to provide a more accurate
FastText.estimate_memorymethod.Specifically, it implements the following improvements:
ft_hashfunctioncompute_ngramsfunctionIn its current state this PR does not merge and does not pass the test suite, due to these issues:
The improvements were done before the refactoring of the word embedding models, likely some code will have to be moved aroundThere are some Python2/3 issues with the cythonizedcompute_ngramsfunctionngramsattribute of the model, but in the optimized model I use the ngram hash. Since these hashes are bucketed, an OOV is most often no longer possible (i.e. if all buckets are occupied), i.e. these tests would have to be removed. Is this okay?I think I can do 1) and 2) on my own when I find the time, but for 3) I'd need some help, since I'm not that familiar with the intentions behind the old code.