Conversation
felixcremer
left a comment
There was a problem hiding this comment.
I really like to have something like that, but why are we restricting this to HTTPStore? Couldn't we make the remote and cache be AbstractStore? Then we could also cache data that comes from S3 or even data that comes from a slower hard disk into a hot loaded scratch folder in HPC.
|
Yeah, thats just my use case we can expand it - what needs to change? can you comment in the code? I dont have my head accross what all the stores are |
|
Why not just CachedDiskArray? I guess I'm not sure why you want the store to be caching rather than the array. |
|
I think the main difference is, that CachedDiskArray is caching it into RAM or into temporary files, while this implementation of the CachingStore caches it into a different local store which would then be persistent between sessions or computer restarts. |
|
Yeah, maybe a different name, but its caching to disk that we need. Stored? |
|
LocallyCached maybe? |
| ``` | ||
| """ | ||
| struct CachingHTTPStore <: AbstractStore | ||
| remote::HTTPStore |
There was a problem hiding this comment.
Could we make this parameterized by the type of the remote and cache stores?
| cache::DirectoryStore | ||
| end | ||
|
|
||
| function CachingHTTPStore(url::AbstractString, cache_path::AbstractString) |
There was a problem hiding this comment.
| function CachingHTTPStore(url::AbstractString, cache_path::AbstractString) | |
| function CachingStore(url::AbstractString, cache_path::AbstractString) |
| end | ||
|
|
||
| function CachingHTTPStore(url::AbstractString, cache_path::AbstractString) | ||
| CachingHTTPStore(HTTPStore(url), DirectoryStore(cache_path)) |
There was a problem hiding this comment.
| CachingHTTPStore(HTTPStore(url), DirectoryStore(cache_path)) | |
| CachingHTTPStore(storefromstring(url), storefromstring(cache_path)) |
| end | ||
|
|
||
| # Accept any object with url and cache fields (e.g., RasterDataSources.CachedCloudSource) | ||
| CachingHTTPStore(source) = CachingHTTPStore(source.url, source.cache) |
There was a problem hiding this comment.
Is this something that should rather live in the RasterDataSource extension on Zarr?
This seems to be surprising in Zarr.
|
For the naming, I think StoredStore is a bit confusing. LocalCache maybe or Hoard but that might be a bit too obscure. I am wondering whether the local cache would be usable as a zarr in its own right? |
|
I just discussed this with @meggart and he would like to rather have this living as CachedDiskArray. The idea is to open up the CachedDiskArray Interface to enable any cache type and having a Zarr store or zarr array as a cache. This enables a more flexible use to also cache reprojected data or computation results. We will have a new chunkfromcache function and then this could dispatch on the cache type. |
|
So what @felixcremer and I discussed today is that it's better if the actual cached diskarray is backed by a ZArray - but we can have a constructor method for CachedDiskArray that takes in a store and a path within it. Would that satisfy your usecase @rafaqz? Then you can do group structures etc., but the actual path would have to be handled by the application that constructs the cached disk array, like RasterDataSources or Rasters. |
This came up adding online era5 zarr sources to RasterDataSources.jl. Usually in RasterDataSources, the first use of a dataset has a download hit, but after that the data is locally stored and use load is much faster. Using the existing HTTPStore breaks that pattern - its always kinda slow to load a lot of data.
This PR is trying to get the best of both in Zarr - look for local chunks first, and download them as HTTPStore if not available.
I added a trait
has_configurable_missing_chunksso this is easy to add to ZarrDatasets.jl - it currently special cases a few types for what is essentially this trait, and CachingHTTPStore would need it too.