Fix formula in gensim.summarization.bm25. Fix #1828#1833
Conversation
|
Hey @sj29-innovate, what was it #1832 #1831 #1830? |
|
I think you are correct , i should probably pre calculate all document lengths. |
|
Please ignore #1832 #1831 #1830 , I was trying to figure out my way through and due to build errors i closed those PR's instead of committing new changes there itself. |
menshikh-iv
left a comment
There was a problem hiding this comment.
Thanks @sj29-innovate, please make last fixes and I'll merge your PR!
| """Calculates frequencies of terms in documents and in corpus. Also computes inverse document frequencies.""" | ||
| for document in self.corpus: | ||
| frequencies = {} | ||
| (self.doc_len).append(len(document)) |
There was a problem hiding this comment.
Useless ( ), please remove.
| idf = self.idf[word] if self.idf[word] >= 0 else EPSILON * average_idf | ||
| score += (idf * self.f[index][word] * (PARAM_K1 + 1) | ||
| / (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * len(document) / self.avgdl))) | ||
| / (self.f[index][word] + PARAM_K1 * (1 - PARAM_B + PARAM_B * self.doc_len[index] / self.avgdl))) |
There was a problem hiding this comment.
please add several simple tests for BM25
|
@menshikh-iv Can you please guide me a bit towards writing unit tests and also what corpus should i be using for the tests. Thanks a lot. |
|
@sj29-innovate |
|
Should I be testing things like document should show maximum matching with itself and other such conditions ? Thanks for your help. |
|
@sj29-innovate sounds good. |
|
@menshikh-iv ci/circleci tests shows command timed out . At some point in the tests it is asking for username , i think it is getting stuck there.Can you please help me with this. |
|
@sj29-innovate please merge fresh "develop" to your PR (CircleCI don't see needed config in your branch -> build from "inner-default-python" config), this should fix your problem! |
|
@menshikh-iv I managed to resolve circleci failing tests . Requesting your feedback for the same. Thanks! |
menshikh-iv
left a comment
There was a problem hiding this comment.
Good work @sj29-innovate, please make last changes and I'll merge your PR!
| @@ -4,7 +4,7 @@ | |||
| # Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | |||
There was a problem hiding this comment.
Please remove `Keras-2.1.2-py2.7.egg from PR (comment here, because I can't add comment to binary file)
| for index, doc_weights in enumerate(weights): | ||
| expected = max(doc_weights) | ||
| predicted = doc_weights[index] | ||
| self.assertEqual(expected, predicted) |
There was a problem hiding this comment.
Should be assertAlmostEqual (because scores are float)
| weights = get_bm25_weights(common_texts) | ||
| for doc_weights in weights: | ||
| for weight in doc_weights: | ||
| self.assertTrue(weight >= 0) |
| """ A document should always get the same weight when matched with a particular document """ | ||
| corpus = [['cat', 'dog', 'mouse'], ['cat', 'lion'], ['cat', 'lion']] | ||
| weights = get_bm25_weights(corpus) | ||
| self.assertEqual(weights[0][1], weights[0][2]) |
There was a problem hiding this comment.
Should be assertAlmostEqual (because scores are float)
| """ Two disjoint documents should have zero matching""" | ||
| corpus = [['cat', 'dog', 'lion'], ['goat', 'fish', 'tiger']] | ||
| weights = get_bm25_weights(corpus) | ||
| self.assertTrue(weights[0][1] == 0) |
There was a problem hiding this comment.
Should be assertAlmostEqual (because scores are float), same for next line.
|
@menshikh-iv I have made the required changes , please let me know if any more changes or add ons are required. Thanks. |
|
I close+open your PR because CI doesn't run correctly, don't worry. |
gensim.summarization.bm25. Fix #1828
|
Big thanks @sj29-innovate, good job! Congratz with the first contribution 🥇 |
|
@menshikh-iv thank you so much for your constant help , can you please suggest me some other tasks as i really wish to contribute more. thanks! |
|
Yes, please work on #1401 (this is the really critical bug). |
* Add wordnet mammal train file for Poincare notebook (piskvorky#1781) * Adds wordnet mammal train file * Adds link to data file in notebook * Fix tox.ini/setup.cfg configuration (piskvorky#1815) * update according to new pytest_benchmark version * update wheel-storage url * use only twine * Fix docstrings for `gensim.utils` (piskvorky#1797) * Add docstrings in numpy-style fromat * fix PEP8 * remove outdated "hack" (smart_open is core dependency right now) * fix docstrings[1] * remove unused internal class * fix docstrings[2] * fix docstrings[3] * fix docstrings[4] * fix docstrings[5] * fix docstrings[6] * fix docstrings[7] * fix docstrings[8] * add missing `pattern` to doc dependencies * fix docstrings[9] * fix docstrings[10] * Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802) * first attempt to convert few lines into numpy-style doc * added parameters in documentation * more documentation * few corrections * show inheritance and undoc members * show special members * example is executable now * link to the paper added, named parameters * fixed doc * fixed doc * fixed whitespaces * fix docstrings & PEP8 * fix docstrings * fix typo * Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806) * convert Space class doc to numpy style * fix docstrings[1] * fix docstrings[2] * remove useless load * fix docstrings[3] * add missing import * fix docstrings[4] * Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822) * init config for circle * change * rm cache, install tox distinctly * fix indentation & command * update venv * add pip-cache * add apt packages for latex * rename * enable latex rendering * remove doc building from Travis * store new doc version * Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714) * Refactored aggregation * Micro-Fix for aggregation.py, partially refactored direct_confirmation.py * Partially refactored indirect_confirmation_measure * Some additions * Math attempts * add math extension for sphinx * Minor refactoring * Some refactoring for probability_estimation * Beta-strings * Different additions * Minor changes * text_analysis left * Added example for ContextVectorComputer class * probability_estimation 0.9 * beta_version * Added some examples for text_analysis * text_analysis: corrected example for class UsesDictionary * Final additions for text_analysis.py * fix cross-reference problem * fix pep8 * fix aggregation * fix direct_confirmation_measure * fix types in direct_confirmation_measure * partial fix indirect_confirmation_measure * HotFix for probability_estimation and segmentation * Refactoring for probability_estimation * Changes for indirect_confirmation_measure * Fixed segmentation, partly fixed text_analysis * Add Notes for text_analysis * fix di/ind * fix doc examples in probability_estimation * fix probability_estimation * fix segmentation * fix docstring in probability_estimation * partial fix test_analysis * add latex stuff for docs build * doc fix[1] * doc fix[2] * remove apt install from travis (now doc build in circle) * Fix docstrings for `gensim.models.normmodel` (piskvorky#1805) * First edits * changed bow * Added examples * Final commit of the night * Still struggling with docs * Removed examples but still struggling with documentation * fix docstring * fix docstring[2] * Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803) * improve and correct documentation of models/logentropy_model * include fixes according to comments * implement fixes suggested * associate methods with examples. * fix minor typos * doc fix * Fix docstrings for `gensim.matutils` (piskvorky#1804) * numpy style documentation on matutils.py * doc fix[1] * doc fix[2] * doc fix[3] * doc fix[4] * doc fix[5] * doc fix[6] * Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833) * bm25 scoring function updated * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 * Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821) * minor style refactoring and comment fixes in accordance to PEP8 * Created test data in legitimate compressed XML format (.xml.bz2) for the WikiCorpus class. * Used the same raw data found for other sources (9 articles). * Added Various wiki markup to test the parsing regural expressions * Added test class for the WikiCorpus source. * Following the same inheritance schema as in the source TestWikiCorpus > TestTextCorpus > CorpusTestCase. * Testing methods are overriden where necessary to reflect logic changes. * All existing functionality is tested (account for markup handling, minimum article length etc) * Fix python 3 compatibility for generator next method * code review corrections * Moved WikiCorpus tests from test/test_wikicorpus.py into its class within the test_corpora.py file. * Adapted all old tests to the new class * Current Test class schema ensures that WikiCorpus also passes tests defined in parents * Deleted test_wikicorpus.py since it is now redundant * Discarded the empty input test for the WikiCorpus since an empty file is not legitimate XML * Added 2 more tests * Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837) * bm25 scoring function updated * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 , Tests Added * Fixes piskvorky#1828 * Function Parameters corrected , Fixes piskvorky#1818 * add missing params + add supercall * Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823) * add keyword params for call to gensim.models.CoherenceModel as positional arguments for coherence and topn were incorrect due to skipping param for keyed_vectors * Fix PEP8
…into smart_open * 'develop' of https://github.com/RaRe-Technologies/gensim: Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823) Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837) Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821) Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833) Fix docstrings for `gensim.matutils` (piskvorky#1804) Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803) Fix docstrings for `gensim.models.normmodel` (piskvorky#1805) Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714) Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822) Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806) Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802) Fix docstrings for `gensim.utils` (piskvorky#1797) Fix tox.ini/setup.cfg configuration (piskvorky#1815) Add wordnet mammal train file for Poincare notebook (piskvorky#1781)
No description provided.