Fix segment-wiki script#1694
Conversation
…enization, more descriptive filed names, stdout support (as default option)
| Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump and extract sections of pages from it | ||
| and save to json-line format. | ||
| Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump (typical filename | ||
| is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2), |
There was a problem hiding this comment.
I'd add a link to the actual place to download those, because it's not obvious.
For example, the English Wiki dump is here: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
| tokenizer). The package is available at https://github.com/clips/pattern . | ||
| 'title' (str) - title of article, | ||
| 'section_titles' (list) - list of titles of sections, | ||
| 'section_texts' (list) - list of content from sections. |
There was a problem hiding this comment.
I'd prefer to include a concrete hands-on example, something like this:
Process a raw Wikipedia dump (XML.bz2 format, for example https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 for the English Wikipedia) and extract all articles and their sections as plain text::
python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o enwiki-20171001-pages-articles.json.gz
The output format of the parsed plain text Wikipedia is json-lines = one article per line, serialized into JSON. Here's an example how to work with it from Python::
# iterate over the plain text file we just created
for line in smart_open('enwiki-20171001-pages-articles.txt.gz'):
# decode JSON into a Python object
article = json.loads(line)
# each article has a "title", "section_titles" and "section_texts" fields
print("Article title: %s" % article['title'])
for section_title, section_text in zip(article['section_titles'], article['section_texts']):
print("Section title: %s" % section_title)
print("Section text: %s" % section_text)
| num_total_tokens += len(utils.lemmatize(section_content)) | ||
| else: | ||
| num_total_tokens += len(tokenize(section_content)) | ||
| if num_total_tokens < ARTICLE_MIN_WORDS or \ |
There was a problem hiding this comment.
Including redirects and stubs is a bad idea. That's typically (never?) what people want, out of Wikipedia dumps.
We want to keep only meaningful articles, such as at least 500 plain text characters (~1 paragraph) or something.
There was a problem hiding this comment.
I don't think so (about short articles), because we provide parsed wikipedia dump "as-is" and short articles can be useful for users for special cases (and easy to filter later if needed), for this reason, I removed this part.
There was a problem hiding this comment.
But it's needed to filter trash (like the redirect), I'll add fix for this.
There was a problem hiding this comment.
Sounds good 👍
Stubs are not really articles though; most of the text is something like "this article is a stub, help Wikipedia by expanding it" or something. Not terribly useful, potentially messing up corpus statistics for people who would be unaware of this.
| # article redirects are pruned here | ||
| if any(article_title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES): | ||
| if any(article_title.startswith(ignore + ':') for ignore in IGNORED_NAMESPACES) \ | ||
| or len(sections) == 0 \ |
| continue | ||
| if len(sections) == 0 or sections[0][1].lstrip().lower().startswith("#redirect"): # filter redirect | ||
| continue | ||
| if sum(len(body.strip()) for (_, body) in sections) < 250: # filter very short articles (thrash) |
There was a problem hiding this comment.
thrash => trash ; but it's more stubs than trash.
The constant (250) should be configurable (min_article_characters?), not hardwired like this.
What's done:
ts,sc);-hoption);