Reduce memory consumption of summarizer by horpto · Pull Request #2298 · piskvorky/gensim

horpto · 2018-12-16T08:42:06Z

By a request of @menshikh-iv:

I found a couple of questions (this and this about memory of summarization when I was looking for an information related to my previous PR with graph nodes removing.
I've taken first part of the english translated War and Peace. It consists of 14426 unique docs (number of nodes). I've tested on computer with 16 GiB of memory, Intel i5, 64-bit Windows 10, Python 3.7. Original version have eaten 14 GiB of memory after 7 minutes of execution and failed with MemoryError. My version have eaten 2 GiB of memory and taken ~2min (108 sec), but failed to process full version War and Peace (~57000 unique docs) with MemoryError too (failed on the phase pagerank).

…factoring

horpto · 2018-12-19T23:36:36Z

I need #2263 to add some tests about graph.

menshikh-iv · 2019-01-10T07:24:21Z

#2263 merged, feel free to continue

…factoring

menshikh-iv

Thanks for PR @horpto, can you also please update your first comment in PR with

what was a reason for these changes (links to complaints about RAM)
little benchmark (time & RAM) 3.6.0 vs summarization-refactoring to better understand, how this improved

menshikh-iv · 2019-01-14T05:36:44Z

+        scores = []
+        for index in range(self.corpus_size):
+            score = self.get_score(document, index)
+            if score > 0:


In that case len(scores) <= self.corpus_size, why?

Because it's actually quite sparse array, isn't it?
In summarizer._set_graph_edge_weights such documents with little weight will be dropped anyway, so there is no reason to waste extra memory. And what's more, if we are needed a dense array we can uncompactify this bow.

I don't get, how we understand ids of documents that have 0 scores in that case?

Easy. Like words with 0 weight in bow. We have ids of docs with not zero weight. They are saved in bag-of-docs (I should rename function name part from bow - bag-of-weights to bod - bag-of-docs). If doc id isn't in bag-of-docs, so weight of doc is 0.

menshikh-iv · 2019-01-18T05:09:39Z

I've taken first part of the english translated War and Peace. It consists of 14426 unique docs (number of nodes). I've tested on computer with 16 GiB of memory, Intel i5, 64-bit Windows 10, Python 3.7. Original version have eaten 14 GiB of memory after 7 minutes of execution and failed with MemoryError. My version have eaten 2 GiB of memory and taken ~2min (108 sec), but failed to process full version War and Peace (~57000 unique docs) with MemoryError too (failed on the phase pagerank).

Awesome result, nice improvement for new release, thanks @horpto 👍

reduce memory consumption

a0523f6

horpto changed the title ~~[WIP] reduce memory consumption~~ [WIP] reduce memory consumption of summarizer Dec 16, 2018

horpto added 3 commits December 16, 2018 21:08

fix build

7273b05

Merge remote-tracking branch 'upstream/develop' into summarization-re…

0d606fa

…factoring

add deleting null-weighted edge

a7d21c2

horpto added 8 commits January 12, 2019 16:44

Merge remote-tracking branch 'upstream/develop' into summarization-re…

3b2c818

…factoring

iterate over bm25 weights

ae39b0b

fix build

c04ed73

fix wrong example

9f0259a

add method iter_graph in Graph

bb9365c

add index argument in SyntacticUnit

b56a811

fix logging messages

70bd79a

refactor graph - remove unnecessary parts

008d5cb

horpto changed the title ~~[WIP] reduce memory consumption of summarizer~~ reduce memory consumption of summarizer Jan 14, 2019

horpto changed the title ~~reduce memory consumption of summarizer~~ Reduce memory consumption of summarizer Jan 14, 2019

horpto commented Jan 14, 2019

View reviewed changes

Comment thread gensim/summarization/pagerank_weighted.py

menshikh-iv reviewed Jan 14, 2019

View reviewed changes

horpto added 3 commits January 17, 2019 23:12

Add test and fix typos

4549a86

fix built (flake8)

f6b68bf

add printing of documents number

b850ff4

menshikh-iv merged commit fabeffe into piskvorky:develop Jan 18, 2019

fbarrios mentioned this pull request Jan 18, 2019

Adapt Gensim improvements summanlp/textrank#59

Open

gojomo mentioned this pull request Jul 6, 2023

Discussion: discard "gensim.summarization"? #2592

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce memory consumption of summarizer#2298

Reduce memory consumption of summarizer#2298
menshikh-iv merged 15 commits into
piskvorky:developfrom
horpto:summarization-refactoring

horpto commented Dec 16, 2018 •

edited

Loading

Uh oh!

horpto commented Dec 19, 2018

Uh oh!

menshikh-iv commented Jan 10, 2019

Uh oh!

Uh oh!

menshikh-iv left a comment

Uh oh!

menshikh-iv Jan 14, 2019

Uh oh!

horpto Jan 15, 2019

Uh oh!

menshikh-iv Jan 15, 2019

Uh oh!

horpto Jan 15, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

menshikh-iv commented Jan 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

horpto commented Dec 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

horpto commented Dec 19, 2018

Uh oh!

menshikh-iv commented Jan 10, 2019

Uh oh!

Uh oh!

menshikh-iv left a comment

Choose a reason for hiding this comment

Uh oh!

menshikh-iv Jan 14, 2019

Choose a reason for hiding this comment

Uh oh!

horpto Jan 15, 2019

Choose a reason for hiding this comment

Uh oh!

menshikh-iv Jan 15, 2019

Choose a reason for hiding this comment

Uh oh!

horpto Jan 15, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

menshikh-iv commented Jan 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

horpto commented Dec 16, 2018 •

edited

Loading