Skip to content

[MRG] Improve segment_wiki script#1707

Merged
menshikh-iv merged 8 commits into
developfrom
fix_segment_wiki
Nov 11, 2017
Merged

[MRG] Improve segment_wiki script#1707
menshikh-iv merged 8 commits into
developfrom
fix_segment_wiki

Conversation

@piskvorky
Copy link
Copy Markdown
Owner

@piskvorky piskvorky commented Nov 10, 2017

This PR cleans up the documentation of the segment_wiki.py script, continuing from #1694.

I also added a new -w option to the script, letting users control the number of workers.

I also fixed some broken formatting in the release notes in the section on segment_wiki.

@piskvorky piskvorky changed the title Improve segment_wiki script [WIP] Improve segment_wiki script Nov 10, 2017


def segment_all_articles(file_path, min_article_character=200):
def segment_all_articles(file_path, min_article_character=200, workers=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why workers=None by default? If default is number of cores - 1, this should be workers=multiprocessing.cpu_count() - 1 or something equal (same for all workers=...)

Copy link
Copy Markdown
Owner Author

@piskvorky piskvorky Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is (see the docs, updated).

Copy link
Copy Markdown
Contributor

@menshikh-iv menshikh-iv Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not substitute the real default value (instead of None), when you have it? Why needed None here?

Copy link
Copy Markdown
Owner Author

@piskvorky piskvorky Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it more explicit -- a person viewing the signature would see a number, and perhaps not realize this number is "variable" (will be different on a different system).

In other words, I find it a little "less surprising" when only constants are the values in default parameters. Anything dynamic/mutable is handled by None instead.

total_sections += len(sections)
yield (article_title, sections)
logger.info(
"finished processing %i articles with %i sections (skipped %i redirects, %i stubs, %i ignored namespaces)",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 'sections' needed here, what is the good of it?

Copy link
Copy Markdown
Owner Author

@piskvorky piskvorky Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity checking.

The rule of thumb for logging is: always display statistics and concrete examples, to make debugging and "sanity eyeballing" easier for humans.

Plus, they are interesting and note-worthy numbers in itself (how many sections are there, in the entire Wikipedia?).

Comment thread gensim/scripts/segment_wiki.py Outdated
parser.add_argument('-o', '--output', help='Path to output file (stdout if not specified)')
parser.add_argument(
'-w', '--workers',
help='Number of parallel workers for multi-core systems (default: %i)' % default_workers,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'(default: %i)' % default workers -> (default: %(default)s)

Copy link
Copy Markdown
Owner Author

@piskvorky piskvorky Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? That doesn't sound right, it changes the semantics (str vs int, different var name).

Or is that some automatic feature of argparse?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's automatic substitution default value, doc

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I didn't know about that. Changed.

@piskvorky piskvorky changed the title [WIP] Improve segment_wiki script [MRG] Improve segment_wiki script Nov 11, 2017

"""
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump (typical filename
is <LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2 or <LANG>wiki-latest-pages-articles.xml.bz2),
Copy link
Copy Markdown
Contributor

@menshikh-iv menshikh-iv Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please return concrete "patterns" for filenames, Wikipedia provides many archives, users very often use the incorrect archive and don't understand "why this doesn't work".

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



def segment_all_articles(file_path, min_article_character=200):
def segment_all_articles(file_path, min_article_character=200, workers=None):
Copy link
Copy Markdown
Contributor

@menshikh-iv menshikh-iv Nov 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not substitute the real default value (instead of None), when you have it? Why needed None here?

@menshikh-iv menshikh-iv merged commit a21e7df into develop Nov 11, 2017
@menshikh-iv menshikh-iv deleted the fix_segment_wiki branch November 11, 2017 10:01
VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this pull request Nov 26, 2017
* improve segment_wiki docstring

* add segment_wiki option to control the number of multicore workers

* improve segment_wiki logging

* update segment_wiki timing

* clarify segment_wiki docs

* fix argparse defaults in segment_wiki

* log more detailed progress in segment_wiki

* example of MediaWiki format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants