Skip to content

Problem with the cache in GDriveFileSystem.find #229

@simone-viozzi

Description

@simone-viozzi

While working on #222, I discovered that find has a bug with the cache.

Let assume self.path=root/tmp/ and a folder structure like:

root/tmp/
        └── fo1/
           ├── file2.pdf
           └── fo2/
              ├── file3.pdf
              └── fo3/
                 └── file4.pdf

now let's do some tests:

f = fs.find('root/tmp/fo1/')
print(f)
> ['root/tmp/fo1/file2.pdf', 'root/tmp/fo1/fo2/file3.pdf', 'root/tmp/fo1/fo2/fo3/file4.pdf']

f = fs.find('root/tmp/fo1/fo2')
print(f)
> ['root/tmp/fo1/fo2/fo3/file4.pdf', 'root/tmp/fo1/fo2/file3.pdf']

and that is correct,
but if we do only find('root/tmp/fo1/fo2'):

f = fs.find('root/tmp/fo1/fo2')
print(f)
>[]

This happens because find relay on the cache, and at the start the cache is only populated with ids from one level down self.path

so in the last example, the content of the cache is:

{
'1IETDYYj23PgGaInZofa9MyANyBlOoiyh': 'tmp', 
'1k6u2-FStB6rOlq6hmDXlRl2aLES1l6vp': 'tmp/fo1', 
}

I think, because there is no tmp/fo1/fo2 (the starting path of find), query_ids stays empty and the method return an empty list.

The lines of code involved are:

def find(self, path, detail=False, **kwargs):
bucket, base = self.split_path(path)
seen_paths = set()
dir_ids = [self._ids_cache["ids"].copy()]
contents = []
while dir_ids:
query_ids = {
dir_id: dir_name
for dir_id, dir_name in dir_ids.pop().items()
if posixpath.commonpath([base, dir_name]) == base
if dir_id not in seen_paths
}
if not query_ids:
continue

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingfsfsspec implementationpriority-p1

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions