Fix formula in `gensim.summarization.bm25`. Fix #1828 by sj29-innovate · Pull Request #1833 · piskvorky/gensim

sj29-innovate · 2018-01-08T11:10:07Z

No description provided.

menshikh-iv · 2018-01-08T12:40:21Z

Hey @sj29-innovate, what was it #1832 #1831 #1830?
Probably better pre-calculate length of documents in initialize method, wdyt?
Also, please add tests for bm25.

sj29-innovate · 2018-01-08T13:25:28Z

I think you are correct , i should probably pre calculate all document lengths.

sj29-innovate · 2018-01-08T13:50:12Z

Please ignore #1832 #1831 #1830 , I was trying to figure out my way through and due to build errors i closed those PR's instead of committing new changes there itself.
I am committing changes to #1833 , have precomputed document lengths in " doc_len" list inside the initialization function so that they can be used elsewhere easily.

menshikh-iv

Thanks @sj29-innovate, please make last fixes and I'll merge your PR!

menshikh-iv · 2018-01-09T07:21:46Z

        """Calculates frequencies of terms in documents and in corpus. Also computes inverse document frequencies."""
        for document in self.corpus:
            frequencies = {}
+            (self.doc_len).append(len(document))


Useless ( ), please remove.

menshikh-iv · 2018-01-09T07:22:06Z

            idf = self.idf[word] if self.idf[word] >= 0 else EPSILON * average_idf
            score += (idf * self.f[index][word] * (PARAM_K1 + 1)
-                      / (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * len(document) / self.avgdl)))
+                      / (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * self.doc_len[index] / self.avgdl)))


please add several simple tests for BM25

sj29-innovate · 2018-01-09T08:17:01Z

@menshikh-iv Can you please guide me a bit towards writing unit tests and also what corpus should i be using for the tests. Thanks a lot.

menshikh-iv · 2018-01-09T09:42:43Z

@sj29-innovate
corpus - any small, for example, you can use common_texts from from gensim.test.utils import common_texts
test - something simple, apply BM25 to corpus and query document (some kind of "sanity-check")

sj29-innovate · 2018-01-09T09:58:54Z

Should I be testing things like document should show maximum matching with itself and other such conditions ? Thanks for your help.

menshikh-iv · 2018-01-09T10:43:55Z

@sj29-innovate sounds good.

sj29-innovate · 2018-01-09T17:05:09Z

@menshikh-iv ci/circleci tests shows command timed out . At some point in the tests it is asking for username , i think it is getting stuck there.Can you please help me with this.

https://circleci.com/gh/RaRe-Technologies/gensim/25?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link#

menshikh-iv · 2018-01-11T05:52:42Z

@sj29-innovate please merge fresh "develop" to your PR (CircleCI don't see needed config in your branch -> build from "inner-default-python" config), this should fix your problem!

…into develop

sj29-innovate · 2018-01-11T10:27:19Z

@menshikh-iv I managed to resolve circleci failing tests . Requesting your feedback for the same. Thanks!

menshikh-iv

Good work @sj29-innovate, please make last changes and I'll merge your PR!

menshikh-iv · 2018-01-11T14:32:08Z

@@ -4,7 +4,7 @@
 # Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html


Please remove `Keras-2.1.2-py2.7.egg from PR (comment here, because I can't add comment to binary file)

menshikh-iv · 2018-01-11T14:35:18Z

+        for index, doc_weights in enumerate(weights):
+            expected = max(doc_weights)
+            predicted = doc_weights[index]
+            self.assertEqual(expected, predicted)


Should be assertAlmostEqual (because scores are float)

menshikh-iv · 2018-01-11T14:35:34Z

+        weights = get_bm25_weights(common_texts)
+        for doc_weights in weights:
+            for weight in doc_weights:
+                self.assertTrue(weight >= 0)


0 -> 0.

menshikh-iv · 2018-01-11T14:35:44Z

+        """ A document should always get the same weight when matched with a particular document """
+        corpus = [['cat', 'dog', 'mouse'], ['cat', 'lion'], ['cat', 'lion']]
+        weights = get_bm25_weights(corpus)
+        self.assertEqual(weights[0][1], weights[0][2])


Should be assertAlmostEqual (because scores are float)

menshikh-iv · 2018-01-11T14:36:22Z

+        """ Two disjoint documents should have zero matching"""
+        corpus = [['cat', 'dog', 'lion'], ['goat', 'fish', 'tiger']]
+        weights = get_bm25_weights(corpus)
+        self.assertTrue(weights[0][1] == 0)


Should be assertAlmostEqual (because scores are float), same for next line.

sj29-innovate · 2018-01-11T15:39:16Z

@menshikh-iv I have made the required changes , please let me know if any more changes or add ons are required. Thanks.

menshikh-iv · 2018-01-11T15:55:47Z

I close+open your PR because CI doesn't run correctly, don't worry.

menshikh-iv · 2018-01-11T17:14:56Z

Big thanks @sj29-innovate, good job! Congratz with the first contribution 🥇

sj29-innovate · 2018-01-11T17:17:14Z

@menshikh-iv thank you so much for your constant help , can you please suggest me some other tasks as i really wish to contribute more. thanks!

menshikh-iv · 2018-01-12T04:51:43Z

Yes, please work on #1401 (this is the really critical bug).

* Add wordnet mammal train file for Poincare notebook (piskvorky#1781) * Adds wordnet mammal train file * Adds link to data file in notebook * Fix tox.ini/setup.cfg configuration (piskvorky#1815) * update according to new pytest_benchmark version * update wheel-storage url * use only twine * Fix docstrings for `gensim.utils` (piskvorky#1797) * Add docstrings in numpy-style fromat * fix PEP8 * remove outdated "hack" (smart_open is core dependency right now) * fix docstrings[1] * remove unused internal class * fix docstrings[2] * fix docstrings[3] * fix docstrings[4] * fix docstrings[5] * fix docstrings[6] * fix docstrings[7] * fix docstrings[8] * add missing `pattern` to doc dependencies * fix docstrings[9] * fix docstrings[10] * Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802) * first attempt to convert few lines into numpy-style doc * added parameters in documentation * more documentation * few corrections * show inheritance and undoc members * show special members * example is executable now * link to the paper added, named parameters * fixed doc * fixed doc * fixed whitespaces * fix docstrings & PEP8 * fix docstrings * fix typo * Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806) * convert Space class doc to numpy style * fix docstrings[1] * fix docstrings[2] * remove useless load * fix docstrings[3] * add missing import * fix docstrings[4] * Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822) * init config for circle * change * rm cache, install tox distinctly * fix indentation & command * update venv * add pip-cache * add apt packages for latex * rename * enable latex rendering * remove doc building from Travis * store new doc version * Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714) * Refactored aggregation * Micro-Fix for aggregation.py, partially refactored direct_confirmation.py * Partially refactored indirect_confirmation_measure * Some additions * Math attempts * add math extension for sphinx * Minor refactoring * Some refactoring for probability_estimation * Beta-strings * Different additions * Minor changes * text_analysis left * Added example for ContextVectorComputer class * probability_estimation 0.9 * beta_version * Added some examples for text_analysis * text_analysis: corrected example for class UsesDictionary * Final additions for text_analysis.py * fix cross-reference problem * fix pep8 * fix aggregation * fix direct_confirmation_measure * fix types in direct_confirmation_measure * partial fix indirect_confirmation_measure * HotFix for probability_estimation and segmentation * Refactoring for probability_estimation * Changes for indirect_confirmation_measure * Fixed segmentation, partly fixed text_analysis * Add Notes for text_analysis * fix di/ind * fix doc examples in probability_estimation * fix probability_estimation * fix segmentation * fix docstring in probability_estimation * partial fix test_analysis * add latex stuff for docs build * doc fix[1] * doc fix[2] * remove apt install from travis (now doc build in circle) * Fix docstrings for `gensim.models.normmodel` (piskvorky#1805) * First edits * changed bow * Added examples * Final commit of the night * Still struggling with docs * Removed examples but still struggling with documentation * fix docstring * fix docstring[2] * Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803) * improve and correct documentation of models/logentropy_model * include fixes according to comments * implement fixes suggested * associate methods with examples. * fix minor typos * doc fix * Fix docstrings for `gensim.matutils` (piskvorky#1804) * numpy style documentation on matutils.py * doc fix[1] * doc fix[2] * doc fix[3] * doc fix[4] * doc fix[5] * doc fix[6] * Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833) * bm25 scoring function updated * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 * Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821) * minor style refactoring and comment fixes in accordance to PEP8 * Created test data in legitimate compressed XML format (.xml.bz2) for the WikiCorpus class. * Used the same raw data found for other sources (9 articles). * Added Various wiki markup to test the parsing regural expressions * Added test class for the WikiCorpus source. * Following the same inheritance schema as in the source TestWikiCorpus > TestTextCorpus > CorpusTestCase. * Testing methods are overriden where necessary to reflect logic changes. * All existing functionality is tested (account for markup handling, minimum article length etc) * Fix python 3 compatibility for generator next method * code review corrections * Moved WikiCorpus tests from test/test_wikicorpus.py into its class within the test_corpora.py file. * Adapted all old tests to the new class * Current Test class schema ensures that WikiCorpus also passes tests defined in parents * Deleted test_wikicorpus.py since it is now redundant * Discarded the empty input test for the WikiCorpus since an empty file is not legitimate XML * Added 2 more tests * Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837) * bm25 scoring function updated * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 * Function Parameters corrected , Fixes piskvorky#1818 * add missing params + add supercall * Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823) * add keyword params for call to gensim.models.CoherenceModel as positional arguments for coherence and topn were incorrect due to skipping param for keyed_vectors * Fix PEP8

…into smart_open * 'develop' of https://github.com/RaRe-Technologies/gensim: Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823) Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837) Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821) Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833) Fix docstrings for `gensim.matutils` (piskvorky#1804) Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803) Fix docstrings for `gensim.models.normmodel` (piskvorky#1805) Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714) Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822) Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806) Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802) Fix docstrings for `gensim.utils` (piskvorky#1797) Fix tox.ini/setup.cfg configuration (piskvorky#1815) Add wordnet mammal train file for Poincare notebook (piskvorky#1781)

sj29-innovate added 2 commits January 8, 2018 13:26

bm25 scoring function updated

f9803cb

Fixes piskvorky#1828

80dc13d

sj29-innovate added 3 commits January 8, 2018 19:00

Fixes piskvorky#1828

848e5ee

Fixes piskvorky#1828

38e2a0f

Fixes piskvorky#1828

dc702de

Fixes piskvorky#1828

ce290dc

menshikh-iv suggested changes Jan 9, 2018

View reviewed changes

sj29-innovate added 5 commits January 9, 2018 16:36

Fixes piskvorky#1828 , Tests added

d147d0c

Fixes piskvorky#1828 , Tests added

77409dd

Fixes piskvorky#1828 , Tests Added

f1f65e2

Fixes piskvorky#1828 , Tests Added

de436bc

Fixes piskvorky#1828 , Tests Added

17ccbcc

sj29-innovate added 4 commits January 11, 2018 13:45

Fixes piskvorky#1828 , Tests Added

6ec708c

Fixes piskvorky#1828

575a24b

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

ef8e918

…into develop

Merge branch 'develop' into samyak-develop

a2a4c61

menshikh-iv suggested changes Jan 11, 2018

View reviewed changes

Fixes piskvorky#1828

fbf4fad

menshikh-iv closed this Jan 11, 2018

menshikh-iv reopened this Jan 11, 2018

menshikh-iv changed the title ~~Fixes #1828~~ Fix formula in gensim.summarization.bm25. Fix #1828 Jan 11, 2018

menshikh-iv merged commit 4c2c638 into piskvorky:develop Jan 11, 2018

sj29-innovate deleted the samyak-develop branch January 11, 2018 17:18

pyup-bot mentioned this pull request Feb 5, 2018

Scheduled weekly dependency update for week 05 workforce-data-initiative/skills-ml#120

Closed

		@@ -4,7 +4,7 @@
		# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

Uh oh!

Conversation

sj29-innovate commented Jan 8, 2018

Uh oh!

menshikh-iv commented Jan 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sj29-innovate commented Jan 8, 2018

Uh oh!

sj29-innovate commented Jan 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

menshikh-iv left a comment

Choose a reason for hiding this comment

Uh oh!

menshikh-iv Jan 9, 2018

Choose a reason for hiding this comment

Uh oh!

menshikh-iv Jan 9, 2018

Choose a reason for hiding this comment

Uh oh!

sj29-innovate commented Jan 9, 2018

Uh oh!

menshikh-iv commented Jan 9, 2018

Uh oh!

sj29-innovate commented Jan 9, 2018

Uh oh!

menshikh-iv commented Jan 9, 2018

Uh oh!

sj29-innovate commented Jan 9, 2018

Uh oh!

menshikh-iv commented Jan 11, 2018

Uh oh!

sj29-innovate commented Jan 11, 2018

Uh oh!

menshikh-iv left a comment

Choose a reason for hiding this comment

Uh oh!

menshikh-iv Jan 11, 2018

Choose a reason for hiding this comment

Uh oh!

menshikh-iv Jan 11, 2018

Choose a reason for hiding this comment

Uh oh!

menshikh-iv Jan 11, 2018

Choose a reason for hiding this comment

Uh oh!

menshikh-iv Jan 11, 2018

Choose a reason for hiding this comment

Uh oh!

menshikh-iv Jan 11, 2018

Choose a reason for hiding this comment

Uh oh!

sj29-innovate commented Jan 11, 2018

Uh oh!

menshikh-iv commented Jan 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

menshikh-iv commented Jan 11, 2018

Uh oh!

sj29-innovate commented Jan 11, 2018

Uh oh!

menshikh-iv commented Jan 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

menshikh-iv commented Jan 8, 2018 •

edited

Loading

sj29-innovate commented Jan 8, 2018 •

edited

Loading

menshikh-iv commented Jan 11, 2018 •

edited

Loading