Add article interlinks to the output of gensim.scripts.segment_wiki. Fix #1712#1839
Merged
Conversation
* New output format is (str, list of (str, str), list of str, reflecting structure (title, [(section_heading, section_content), ...], [interlink, ...]) * `filter_wiki` in WikiCorpus will not promote uncaught markup to plain text as this will give up valuable information for the interlink discovery
menshikh-iv
suggested changes
Jan 15, 2018
Contributor
There was a problem hiding this comment.
Great start @steremma!
It will be really nice if you add several simple tests for segment_wiki script (you already have small wiki dump as test data)
| for idx, (article_title, article_sections) in enumerate(article_stream): | ||
| output_data = {"title": article_title, "section_titles": [], "section_texts": []} | ||
| for idx, (article_title, article_sections, article_interlinks) in enumerate(article_stream): | ||
| output_data = {"title": article_title, |
Contributor
There was a problem hiding this comment.
please use hanging indents (instead of vertical)
|
|
||
| logger.info("finished running %s", sys.argv[0]) | ||
|
|
||
| print("-----Now checking output--------\n\n\n") |
Contributor
There was a problem hiding this comment.
Why? This isn't needed here.
| output_data = {"title": article_title, | ||
| "section_titles": [], | ||
| "section_texts": [], | ||
| "section_interlinks": article_interlinks |
Contributor
There was a problem hiding this comment.
If you don't split interlinks by sections, this should name as "interlinks" instead of "section_interlinks".
…re disregarded * Due to preprocessing in `filter_wiki` interlinks containing alternative names had one of the 2 `[` and `]` characters removed. The regex now takes that into account.
* Initiate unit testing for all scripts. * Check for expected len given article filtering (namespace, size in characters and redirections). * Check for yielded title, section headings and texts as well as interlinks yielded from generator. * Check that the same is correctly persisted in JSON. * Fix PEP 8
* Refactored filtering functions in ``wikicorpus.py` so that uncaught markup can be optionally promoted to plain text * Interlink extraction logic moved to `wikicorpus.py` * Unit tests modified accordingly
gensim.scripts.segment_wiki()gensim.scripts.segment_wiki. Fix #1712
napsternxg
reviewed
Jan 19, 2018
| Find all interlinks to other articles in the dump. `raw` is either unicode | ||
| or utf-8 encoded string. | ||
| """ | ||
| interlink_regex_capture = r"\[{1,2}(.*?)\]{1,2}" |
Contributor
There was a problem hiding this comment.
I think this regex can be compiled outside the function.
…ual interlink text * Used boolean argument with default argument in `filter_wiki`. The default value keeps the old functionality so that existing code does not brake * Overriding the default argument causes interlinks to not be simplified and lets `find_interlinks` create the mappings
Contributor
|
@steremma please don't forget to
|
…mand line argument
…into interlinks * Kept documentation improvements from upstream * Kept interlink support and updated signatures from my branch * Added documentation from my extra arguments in correct format
Contributor
|
Looks great, thank you @steremma! |
Contributor
Author
|
👍 |
sj29-innovate
pushed a commit
to sj29-innovate/gensim
that referenced
this pull request
Feb 21, 2018
…Fix piskvorky#1712 (piskvorky#1839) * promoting the markup gives up information needed to find the intelinks * Add interlinks to the output of `segment_wiki` * New output format is (str, list of (str, str), list of str, reflecting structure (title, [(section_heading, section_content), ...], [interlink, ...]) * `filter_wiki` in WikiCorpus will not promote uncaught markup to plain text as this will give up valuable information for the interlink discovery * Fixed PEP 8 * Refactoring identation and variable names * Removed debugging code from script * Fixed a bug where interlinks with a description or multiple names where disregarded * Due to preprocessing in `filter_wiki` interlinks containing alternative names had one of the 2 `[` and `]` characters removed. The regex now takes that into account. * Now stripping whitespace off section titles * Unit test `gensim.scripts.segment_wiki` * Initiate unit testing for all scripts. * Check for expected len given article filtering (namespace, size in characters and redirections). * Check for yielded title, section headings and texts as well as interlinks yielded from generator. * Check that the same is correctly persisted in JSON. * Fix PEP 8 * Fix Python 3.5 compatibility * Section text now completely clean from wiki markup * Refactored filtering functions in ``wikicorpus.py` so that uncaught markup can be optionally promoted to plain text * Interlink extraction logic moved to `wikicorpus.py` * Unit tests modified accordingly * Added extra logging info to troublehsoot weird Travis behavior * Fix PEP 8 * pin workers for segment_and_write_all_articles * Get rid of debugging stuff * Get rid of global logger * Interlinks are now mapping from the linked article's title to the actual interlink text * Used boolean argument with default argument in `filter_wiki`. The default value keeps the old functionality so that existing code does not brake * Overriding the default argument causes interlinks to not be simplified and lets `find_interlinks` create the mappings * Moved regex outside function * Interlink extraction is now optional and controlled with the `-i` command line argument * PEP 8 long lines * made scripts tests aware of the optional interlinks argument * Updated script help output for interlinks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Interlinks
This PR adds the feature requested in #1712
The
segment_wikiscript now includes the interlinks (links to other articles in the dump) found in each article it processes.Implementation
The interlink markup parsing takes place in the filtered text rather than the raw one in order to avoid code and logic duplication (similar markup would need to be removed again which is already done in
filter_wiki). In order to achieve this the functionfilter_wikimust not promote the uncaught markup to plain text. The usefulness of this operation was already doubted (see comment in old version) so I removed the promotion altogether:Removing this line was important because:
Phrases like "Computer Science" which has a meaning as a phrase will lose its meaning if its promoted to plain text (2 tokens)
the
[[and]]are the markup used for interlinks. So if we promote them then we lose this informationSince the output of the script and its helper functions changed, I modified the documentation (source comments) accordingly.