-
-
Notifications
You must be signed in to change notification settings - Fork 187
Description
I'm trying to open a 500 GB tar archive in PyFilesystem. It takes about 45 minutes to open (it needs to read through the entire file). The time it takes to open is the first issue I'm having. I'm able to open and stream files via the standard tarfile package using tarfile.open and mytar.next() with no delay. It looks like tarfs.py:275 self._directory = OrderedDict((relpath(self._decode(info.name)).rstrip("/"), info) for info in self._tar) is doing the full read. It seems like this could be lazy initialized, or probably even better if the tarfile package was queried directly on demand rather than maintaining a full dictionary of objects in PyFilesystem.
After I get it to open though, when I try to walk the files with mytarfs.walk.files() I am encountering this exception:
Traceback (most recent call last):
File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-8-5f315c5de15b>", line 1, in <module>
next(g)
File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 362, in files
for _path, info in self._iter_walk(fs, path=path):
File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 433, in _walk_breadth
for info in _scan(fs, dir_path, namespaces=namespaces):
File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/walk.py", line 294, in _scan
for info in fs.scandir(dir_path, namespaces=namespaces):
File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/base.py", line 1256, in scandir
for name in self.listdir(path)
File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/tarfs.py", line 389, in listdir
return list(OrderedDict.fromkeys(content))
File "/home/davidparks21/opt/anaconda3/lib/python3.6/site-packages/fs/tarfs.py", line 388, in <genexpr>
content = (parts(child)[1] for child in children if relpath(child))
IndexError: list index out of range
It appears the assumption that 2 elements are returned by parts(child) is not always correct. I have tried to reproduce the issue with a small tar file, but I was unable to do so, it only occurs on my large 500 GB tar file. Unfortunately the time it takes to open it is hindering efforts to debug the issue, so for now I can only report it.