[WIP] Data/model storage. Fix 1453 by chaitaliSaini · Pull Request #1632 · piskvorky/gensim

chaitaliSaini · 2017-10-17T08:54:32Z

API for dataset/model storage (old PR #1492).

menshikh-iv · 2017-10-26T12:51:50Z

+
+
+def _create_base_dir():
+    r"""Create the gensim-data directory in home directory, if it has not been already created.


Is it really needed to add r for all docstrings ? What's a reason?

menshikh-iv · 2017-10-26T12:54:47Z

+    sys.stdout.flush()
+
+
+def _create_base_dir():


Maybe use __ instead of _ will be better (for hiding from import), here and everywhere?

menshikh-iv · 2017-10-26T12:57:47Z

+
+if __name__ == '__main__':
+    logging.basicConfig(format='%(asctime)s :%(name)s :%(levelname)s :%(message)s', stream=sys.stdout, level=logging.INFO)
+    parser = argparse.ArgumentParser(description="Gensim console API", usage="python -m gensim.api.downloader  [-h] [-d data__name | -i data__name | -c]")


No need to pass custom "usage" string here (argparse will generate it automatically)

menshikh-iv · 2017-10-26T12:58:12Z

+    logging.basicConfig(format='%(asctime)s :%(name)s :%(levelname)s :%(message)s', stream=sys.stdout, level=logging.INFO)
+    parser = argparse.ArgumentParser(description="Gensim console API", usage="python -m gensim.api.downloader  [-h] [-d data__name | -i data__name | -c]")
+    group = parser.add_mutually_exclusive_group()
+    group.add_argument("-d", "--download", metavar="data__name", nargs=1, help="To download a corpus/model : python -m gensim.downloader -d corpus/model name")


Strange names for metavar, why metavar is needed here?

menshikh-iv · 2017-10-26T13:01:07Z

+        logger.info("%s downloaded", name)
+    else:
+        rmtree(tmp_dir)
+        raise Exception("There was a problem in downloading the data. We recommend you to re-try.")


Add info about checksums (concrete filename, expected checksum, real checksum, expected size, real size).

menshikh-iv · 2017-10-26T13:04:37Z

Great job @chaitaliSaini, now your code is more readable and clear (and works stable) 🔥 👍 @anotherbugmaster will review your docstrings today.

tefimov

Good job, thank you! Fix the minor issues and check out this styleguide (in case you haven't yet), it will help you write consistent documentation:

https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#docstring-standard

tefimov · 2017-10-26T15:16:17Z

+
+
+def progress(chunks_downloaded, chunk_size, total_size):
+    r"""Create and update the progress bar.


Why is r necessary?

tefimov · 2017-10-26T15:20:04Z

+    filled_len = int(math.floor((bar_len * size_downloaded) / total_size))
+    percent_downloaded = round((size_downloaded * 100) / total_size, 1)
+    bar = '=' * filled_len + '-' * (bar_len - filled_len)
+    sys.stdout.write('[%s] %s%s %s/%sMB downloaded\r' % (bar, percent_downloaded, "%", round(size_downloaded / (1024 * 1024), 1), round(float(total_size) / (1024 * 1024), 1)))


tefimov · 2017-10-26T15:21:22Z

+
+
+def _calculate_md5_checksum(tar_file):
+    r"""Calculate the checksum of the given tar.gz file.


tefimov · 2017-10-26T15:33:05Z

+def info(name=None):
+    r"""Return the information related to model/dataset.
+
+    If name is supplied, then information related to the given dataset/model will be returned. Otherwise detailed information of all model/datasets will be returned.


Too long, split it

tefimov · 2017-10-26T15:33:19Z

+    Returns
+    -------
+    dict
+        Return detailed information about all models/datasets if name is not provided. Otherwise return detailed informtiona of the specific model/dataset


tefimov · 2017-10-26T15:36:21Z

+    data:
+        load model to memory
+    data_dir: str
+        return path of dataset/model.


No new line after last section

tefimov · 2017-10-26T15:41:07Z

+
+    Parameters
+    ----------
+    name : {None, data name}, optional


name : str or None, optional is the right way. Also try to write a description after every parameter.

tefimov · 2017-10-26T15:42:22Z

+    Parameters
+    ----------
+    name: str
+        dataset/model name


Capital letters

tefimov · 2017-10-26T15:42:37Z

+    Parameters
+    ----------
+    name: str
+        dataset/model name which has to be downloaded


Also capital letters

menshikh-iv · 2017-11-07T12:00:22Z

+import numpy as np
+
+
+class TestApi(unittest.TestCase):


Need to add test for multipart

menshikh-iv · 2017-11-07T12:02:19Z

+import math
+import shutil
+import tempfile
+try:


One try/catch is enough here.

menshikh-iv · 2017-11-07T12:10:30Z

+    Parameters
+    ----------
+    chunks_downloaded : int
+        Number of chunks of data that have been downloaded


. at the end of sentence (here and anywhere)

menshikh-iv · 2017-11-07T12:10:56Z

+
+def _create_base_dir():
+    """Create the gensim-data directory in home directory, if it has not been already created.
+    Raises


missing newline before section title

menshikh-iv · 2017-11-07T12:38:58Z

+    """Create the gensim-data directory in home directory, if it has not been already created.
+    Raises
+    ------
+    File Exists Error


Raises --------- Exception Two possible reasons: ...

menshikh-iv · 2017-11-07T12:41:57Z

+            return data['models'][name]["checksum"]
+    else:
+        if name in corpora:
+            return data['corpora'][name]["checksum-" + str(part)]


"cheksum-{}".format(part) instead

menshikh-iv · 2017-11-07T12:44:02Z

+    tmp_dir = tempfile.mkdtemp()
+    tmp_load_file_path = os.path.join(tmp_dir, "__init__.py")
+    urllib.urlretrieve(url_load_file, tmp_load_file_path)
+    no_parts = int(_get_parts(name))


store it as int, don't cast

menshikh-iv · 2017-11-07T12:47:35Z

+            compressed_folder_name = "{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))
+            tmp_data_file_dir = os.path.join(tmp_dir, compressed_folder_name)
+            logger.info("Downloading Part %s/%s", part, no_parts)
+            urllib.urlretrieve(url_data, tmp_data_file_dir, reporthook=_progress)


Show part on progressbar

menshikh-iv · 2017-11-07T12:54:53Z

+        concatenated_folder_dir = os.path.join(tmp_dir, concatenated_folder_name)
+        for part in range(1, no_parts + 1):
+            url_data = "https://github.com/chaitaliSaini/gensim-data/releases/download/{f}/{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))
+            compressed_folder_name = "{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))


Use numeric suffixes

menshikh-iv · 2017-11-07T12:56:27Z

+        os.remove(concatenated_folder_dir)
+        os.rename(tmp_dir, data_folder_dir)
+    else:
+        url_data = "https://github.com/chaitaliSaini/gensim-data/releases/download/{f}/{f}.tar.gz".format(f=name)


Make distinct function

menshikh-iv · 2017-11-08T16:22:16Z

+            logger.info("%s \n", json.dumps(data['corpora'][name], indent=4))
+            return data['corpora'][name]
+        elif name in models:
+            logger.info("%s \n", json.dumps(data['corpora'][name], indent=4))


Bug data['corpora'][name] -> data['models'][name]

menshikh-iv · 2017-11-10T04:35:03Z

Finished in #1705

chaitaliSaini added 16 commits July 30, 2017 04:59

added download and catalogue functions

ec8c016

added link and info

636bfff

modeified link and info functions

fffe203

Updated download function

f567dee

Added logging

61ba3d6

Added load function

d8257a3

Removed unused imports

5571469

added check for installed models

cabf173

updated download function

5d509fc

Improved help for terminal

551f54e

load returns model path

ff5509f

added jupyter notebook and merged code

e654070

alternate names for load

b0d1110

corrected formatting

498b32b

added checksum after download

03649b0

refactored code

7fbf228

menshikh-iv changed the title ~~[WIP]Data/model storage~~ [WIP] Data/model storage Oct 17, 2017

menshikh-iv changed the title ~~[WIP] Data/model storage~~ [WIP] Data/model storage. Fix 1453 Oct 17, 2017

menshikh-iv added the incubator project PR is RaRe incubator project label Oct 17, 2017

chaitaliSaini added 3 commits October 17, 2017 15:16

removed log file code

d0311d1

added progressbar

7e00e2d

fixed pep8

f38670d

menshikh-iv suggested changes Oct 26, 2017

View reviewed changes

menshikh-iv requested review from menshikh-iv and removed request for menshikh-iv October 26, 2017 13:05

tefimov suggested changes Oct 26, 2017

View reviewed changes

chaitaliSaini added 2 commits October 31, 2017 12:59

added tests

4cadfa2

added download for >2gb data

e844e01

menshikh-iv suggested changes Nov 7, 2017

View reviewed changes

chaitaliSaini added 2 commits November 8, 2017 17:45

add test for multipart

580a93a

fixed pep8

e899f88

menshikh-iv suggested changes Nov 8, 2017

View reviewed changes

fixed bug

8eeec54

menshikh-iv closed this Nov 10, 2017



		def _create_base_dir():
		r"""Create the gensim-data directory in home directory, if it has not been already created.



		def progress(chunks_downloaded, chunk_size, total_size):
		r"""Create and update the progress bar.



		def _calculate_md5_checksum(tar_file):
		r"""Calculate the checksum of the given tar.gz file.

		import numpy as np


		class TestApi(unittest.TestCase):

Uh oh!

Conversation

chaitaliSaini commented Oct 17, 2017 • edited by menshikh-iv Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

menshikh-iv commented Oct 26, 2017

Uh oh!

tefimov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tefimov Oct 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

menshikh-iv commented Nov 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chaitaliSaini commented Oct 17, 2017 •

edited by menshikh-iv

Loading

tefimov left a comment •

edited

Loading

tefimov Oct 26, 2017 •

edited

Loading