Skip to content

[WIP] Data/model storage. Fix 1453#1632

Closed
chaitaliSaini wants to merge 24 commits into
piskvorky:developfrom
chaitaliSaini:develop
Closed

[WIP] Data/model storage. Fix 1453#1632
chaitaliSaini wants to merge 24 commits into
piskvorky:developfrom
chaitaliSaini:develop

Conversation

@chaitaliSaini
Copy link
Copy Markdown
Contributor

@chaitaliSaini chaitaliSaini commented Oct 17, 2017

API for dataset/model storage (old PR #1492).

@menshikh-iv menshikh-iv changed the title [WIP]Data/model storage [WIP] Data/model storage Oct 17, 2017
@menshikh-iv menshikh-iv changed the title [WIP] Data/model storage [WIP] Data/model storage. Fix 1453 Oct 17, 2017
@menshikh-iv menshikh-iv added the incubator project PR is RaRe incubator project label Oct 17, 2017
Comment thread gensim/downloader.py Outdated


def _create_base_dir():
r"""Create the gensim-data directory in home directory, if it has not been already created.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really needed to add r for all docstrings ? What's a reason?

Comment thread gensim/downloader.py
sys.stdout.flush()


def _create_base_dir():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use __ instead of _ will be better (for hiding from import), here and everywhere?

Comment thread gensim/downloader.py Outdated

if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s :%(name)s :%(levelname)s :%(message)s', stream=sys.stdout, level=logging.INFO)
parser = argparse.ArgumentParser(description="Gensim console API", usage="python -m gensim.api.downloader [-h] [-d data__name | -i data__name | -c]")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to pass custom "usage" string here (argparse will generate it automatically)

Comment thread gensim/downloader.py Outdated
logging.basicConfig(format='%(asctime)s :%(name)s :%(levelname)s :%(message)s', stream=sys.stdout, level=logging.INFO)
parser = argparse.ArgumentParser(description="Gensim console API", usage="python -m gensim.api.downloader [-h] [-d data__name | -i data__name | -c]")
group = parser.add_mutually_exclusive_group()
group.add_argument("-d", "--download", metavar="data__name", nargs=1, help="To download a corpus/model : python -m gensim.downloader -d corpus/model name")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange names for metavar, why metavar is needed here?

Comment thread gensim/downloader.py Outdated
logger.info("%s downloaded", name)
else:
rmtree(tmp_dir)
raise Exception("There was a problem in downloading the data. We recommend you to re-try.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add info about checksums (concrete filename, expected checksum, real checksum, expected size, real size).

@menshikh-iv
Copy link
Copy Markdown
Contributor

Great job @chaitaliSaini, now your code is more readable and clear (and works stable) 🔥 👍 @anotherbugmaster will review your docstrings today.

@menshikh-iv menshikh-iv requested review from menshikh-iv and removed request for menshikh-iv October 26, 2017 13:05
Copy link
Copy Markdown
Contributor

@tefimov tefimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, thank you! Fix the minor issues and check out this styleguide (in case you haven't yet), it will help you write consistent documentation:

https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#docstring-standard

Comment thread gensim/downloader.py Outdated


def progress(chunks_downloaded, chunk_size, total_size):
r"""Create and update the progress bar.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is r necessary?

Comment thread gensim/downloader.py Outdated
filled_len = int(math.floor((bar_len * size_downloaded) / total_size))
percent_downloaded = round((size_downloaded * 100) / total_size, 1)
bar = '=' * filled_len + '-' * (bar_len - filled_len)
sys.stdout.write('[%s] %s%s %s/%sMB downloaded\r' % (bar, percent_downloaded, "%", round(size_downloaded / (1024 * 1024), 1), round(float(total_size) / (1024 * 1024), 1)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too long

Comment thread gensim/downloader.py Outdated


def _calculate_md5_checksum(tar_file):
r"""Calculate the checksum of the given tar.gz file.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r again :)

Comment thread gensim/downloader.py Outdated
def info(name=None):
r"""Return the information related to model/dataset.

If name is supplied, then information related to the given dataset/model will be returned. Otherwise detailed information of all model/datasets will be returned.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too long, split it

Comment thread gensim/downloader.py Outdated
Returns
-------
dict
Return detailed information about all models/datasets if name is not provided. Otherwise return detailed informtiona of the specific model/dataset
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too long

Comment thread gensim/downloader.py Outdated
data:
load model to memory
data_dir: str
return path of dataset/model.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new line after last section

Comment thread gensim/downloader.py Outdated

Parameters
----------
name : {None, data name}, optional
Copy link
Copy Markdown
Contributor

@tefimov tefimov Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name : str or None, optional is the right way. Also try to write a description after every parameter.

Comment thread gensim/downloader.py Outdated
Parameters
----------
name: str
dataset/model name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capital letters

Comment thread gensim/downloader.py Outdated
Parameters
----------
name: str
dataset/model name which has to be downloaded
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also capital letters

Comment thread gensim/test/test_api.py
import numpy as np


class TestApi(unittest.TestCase):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add test for multipart

Comment thread gensim/downloader.py Outdated
import math
import shutil
import tempfile
try:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One try/catch is enough here.

Comment thread gensim/downloader.py Outdated
Parameters
----------
chunks_downloaded : int
Number of chunks of data that have been downloaded
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. at the end of sentence (here and anywhere)

Comment thread gensim/downloader.py

def _create_base_dir():
"""Create the gensim-data directory in home directory, if it has not been already created.
Raises
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline before section title

Comment thread gensim/downloader.py Outdated
"""Create the gensim-data directory in home directory, if it has not been already created.
Raises
------
File Exists Error
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raises
---------
Exception
    Two possible reasons: ...

Comment thread gensim/downloader.py Outdated
return data['models'][name]["checksum"]
else:
if name in corpora:
return data['corpora'][name]["checksum-" + str(part)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"cheksum-{}".format(part) instead

Comment thread gensim/downloader.py Outdated
tmp_dir = tempfile.mkdtemp()
tmp_load_file_path = os.path.join(tmp_dir, "__init__.py")
urllib.urlretrieve(url_load_file, tmp_load_file_path)
no_parts = int(_get_parts(name))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store it as int, don't cast

Comment thread gensim/downloader.py Outdated
compressed_folder_name = "{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))
tmp_data_file_dir = os.path.join(tmp_dir, compressed_folder_name)
logger.info("Downloading Part %s/%s", part, no_parts)
urllib.urlretrieve(url_data, tmp_data_file_dir, reporthook=_progress)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show part on progressbar

Comment thread gensim/downloader.py Outdated
concatenated_folder_dir = os.path.join(tmp_dir, concatenated_folder_name)
for part in range(1, no_parts + 1):
url_data = "https://github.com/chaitaliSaini/gensim-data/releases/download/{f}/{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))
compressed_folder_name = "{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use numeric suffixes

Comment thread gensim/downloader.py Outdated
os.remove(concatenated_folder_dir)
os.rename(tmp_dir, data_folder_dir)
else:
url_data = "https://github.com/chaitaliSaini/gensim-data/releases/download/{f}/{f}.tar.gz".format(f=name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make distinct function

Comment thread gensim/downloader.py Outdated
logger.info("%s \n", json.dumps(data['corpora'][name], indent=4))
return data['corpora'][name]
elif name in models:
logger.info("%s \n", json.dumps(data['corpora'][name], indent=4))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug data['corpora'][name] -> data['models'][name]

@menshikh-iv
Copy link
Copy Markdown
Contributor

Finished in #1705

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

incubator project PR is RaRe incubator project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants