[MRG] Improve segment_wiki script#1707
Conversation
|
|
||
|
|
||
| def segment_all_articles(file_path, min_article_character=200): | ||
| def segment_all_articles(file_path, min_article_character=200, workers=None): |
There was a problem hiding this comment.
Why workers=None by default? If default is number of cores - 1, this should be workers=multiprocessing.cpu_count() - 1 or something equal (same for all workers=...)
There was a problem hiding this comment.
It is (see the docs, updated).
There was a problem hiding this comment.
Why not substitute the real default value (instead of None), when you have it? Why needed None here?
There was a problem hiding this comment.
I find it more explicit -- a person viewing the signature would see a number, and perhaps not realize this number is "variable" (will be different on a different system).
In other words, I find it a little "less surprising" when only constants are the values in default parameters. Anything dynamic/mutable is handled by None instead.
| total_sections += len(sections) | ||
| yield (article_title, sections) | ||
| logger.info( | ||
| "finished processing %i articles with %i sections (skipped %i redirects, %i stubs, %i ignored namespaces)", |
There was a problem hiding this comment.
Why is 'sections' needed here, what is the good of it?
There was a problem hiding this comment.
Sanity checking.
The rule of thumb for logging is: always display statistics and concrete examples, to make debugging and "sanity eyeballing" easier for humans.
Plus, they are interesting and note-worthy numbers in itself (how many sections are there, in the entire Wikipedia?).
| parser.add_argument('-o', '--output', help='Path to output file (stdout if not specified)') | ||
| parser.add_argument( | ||
| '-w', '--workers', | ||
| help='Number of parallel workers for multi-core systems (default: %i)' % default_workers, |
There was a problem hiding this comment.
'(default: %i)' % default workers -> (default: %(default)s)
There was a problem hiding this comment.
Why? That doesn't sound right, it changes the semantics (str vs int, different var name).
Or is that some automatic feature of argparse?
There was a problem hiding this comment.
Yes, it's automatic substitution default value, doc
There was a problem hiding this comment.
Cool, I didn't know about that. Changed.
|
|
||
| """ | ||
| Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump (typical filename | ||
| is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2), |
There was a problem hiding this comment.
Please return concrete "patterns" for filenames, Wikipedia provides many archives, users very often use the incorrect archive and don't understand "why this doesn't work".
|
|
||
|
|
||
| def segment_all_articles(file_path, min_article_character=200): | ||
| def segment_all_articles(file_path, min_article_character=200, workers=None): |
There was a problem hiding this comment.
Why not substitute the real default value (instead of None), when you have it? Why needed None here?
* improve segment_wiki docstring * add segment_wiki option to control the number of multicore workers * improve segment_wiki logging * update segment_wiki timing * clarify segment_wiki docs * fix argparse defaults in segment_wiki * log more detailed progress in segment_wiki * example of MediaWiki format
This PR cleans up the documentation of the
segment_wiki.pyscript, continuing from #1694.I also added a new
-woption to the script, letting users control the number of workers.I also fixed some broken formatting in the release notes in the section on segment_wiki.