Skip to content

Issues with TarFS - slow to open large tar archives, and exception #275

@davidparks21

Description

@davidparks21

I'm trying to open a 500 GB tar archive in PyFilesystem. It takes about 45 minutes to open (it needs to read through the entire file). The time it takes to open is the first issue I'm having. I'm able to open and stream files via the standard tarfile package using tarfile.open and mytar.next() with no delay. It looks like tarfs.py:275 self._directory = OrderedDict((relpath(self._decode(info.name)).rstrip("/"), info) for info in self._tar) is doing the full read. It seems like this could be lazy initialized, or probably even better if the tarfile package was queried directly on demand rather than maintaining a full dictionary of objects in PyFilesystem.

After I get it to open though, when I try to walk the files with mytarfs.walk.files() I am encountering this exception:

Traceback (most recent call last):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-5f315c5de15b>", line 1, in <module>
    next(g)
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 362, in files
    for _path, info in self._iter_walk(fs, path=path):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 433, in _walk_breadth
    for info in _scan(fs, dir_path, namespaces=namespaces):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 294, in _scan
    for info in fs.scandir(dir_path, namespaces=namespaces):
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/base.py", line 1256, in scandir
    for name in self.listdir(path)
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/tarfs.py", line 389, in listdir
    return list(OrderedDict.fromkeys(content))
  File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/tarfs.py", line 388, in <genexpr>
    content = (parts(child)[1] for child in children if relpath(child))
IndexError: list index out of range

It appears the assumption that 2 elements are returned by parts(child) is not always correct. I have tried to reproduce the issue with a small tar file, but I was unable to do so, it only occurs on my large 500 GB tar file. Unfortunately the time it takes to open it is hindering efforts to debug the issue, so for now I can only report it.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions