Skip to content

Allow name_only option gensim downloader api#2143

Merged
menshikh-iv merged 35 commits into
piskvorky:developfrom
aneesh-joshi:name_only_develop
Aug 3, 2018
Merged

Allow name_only option gensim downloader api#2143
menshikh-iv merged 35 commits into
piskvorky:developfrom
aneesh-joshi:name_only_develop

Conversation

@aneesh-joshi
Copy link
Copy Markdown
Contributor

@aneesh-joshi aneesh-joshi commented Jul 31, 2018

Currently, to get the exact names of the models or corpora, a user has to either:

  1. run gensim.info() and look through the huge json dump to get the exact names
  2. go to the gensim-data website and check

When using gensim-data, I often forget the exact key.
"Was it 'glove-wiki-gigaword' or 'glove-gigaword-wiki'?"

It would be very helpful if a user could, in the terminal or otherwise, type:
gensim.info(name_only=True) or
python -m gensim.downloader --info_name_only
ans get the following output:

{
    "corpora": [
        "semeval-2016-2017-task3-subtaskBC",
        "semeval-2016-2017-task3-subtaskA-unannotated",
        "patent-2017",
        "quora-duplicate-questions",
        "wiki-english-20171001",
        "text8",
        "fake-news",
        "20-newsgroups",
        "__testing_matrix-synopsis",
        "__testing_multipart-matrix-synopsis"
    ],
    "models": [
        "fasttext-wiki-news-subwords-300",
        "conceptnet-numberbatch-17-06-300",
        "word2vec-ruscorpora-300",
        "word2vec-google-news-300",
        "glove-wiki-gigaword-50",
        "glove-wiki-gigaword-100",
        "glove-wiki-gigaword-200",
        "glove-wiki-gigaword-300",
        "glove-twitter-25",
        "glove-twitter-50",
        "glove-twitter-100",
        "glove-twitter-200",
        "__testing_word2vec-matrix-synopsis"
    ]
}

Notes:
The current develop's downloader.py is failing the doctests without me doing anything.

Comment thread gensim/downloader.py Outdated
Also, this API available via CLI::

python -m gensim.downloader --info <dataname> # same as api.info(dataname)
python -m gensim.downloader --info_name_only # same as api.info(name_only=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to do it as parameter of --info flag I think (instead of new --info_* flag), like --info name

@aneesh-joshi
Copy link
Copy Markdown
Contributor Author

changes made @menshikh-iv

Comment thread gensim/downloader.py Outdated
Also, this API available via CLI::

python -m gensim.downloader --info <dataname> # same as api.info(dataname)
python -m gensim.downloader --info name_only # same as api.info(name_only=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--info name please :) (but stay name_only parameter for CLI)

@menshikh-iv
Copy link
Copy Markdown
Contributor

@aneesh-joshi thanks!

@menshikh-iv menshikh-iv merged commit 4520adf into piskvorky:develop Aug 3, 2018
@aneesh-joshi aneesh-joshi deleted the name_only_develop branch August 3, 2018 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants