Skip to content

add CachingHTTPStore#231

Open
rafaqz wants to merge 3 commits intoJuliaIO:masterfrom
rafaqz:chaching_http_store
Open

add CachingHTTPStore#231
rafaqz wants to merge 3 commits intoJuliaIO:masterfrom
rafaqz:chaching_http_store

Conversation

@rafaqz
Copy link
Copy Markdown

@rafaqz rafaqz commented Feb 24, 2026

This came up adding online era5 zarr sources to RasterDataSources.jl. Usually in RasterDataSources, the first use of a dataset has a download hit, but after that the data is locally stored and use load is much faster. Using the existing HTTPStore breaks that pattern - its always kinda slow to load a lot of data.

This PR is trying to get the best of both in Zarr - look for local chunks first, and download them as HTTPStore if not available.

I added a trait has_configurable_missing_chunks so this is easy to add to ZarrDatasets.jl - it currently special cases a few types for what is essentially this trait, and CachingHTTPStore would need it too.

Copy link
Copy Markdown
Contributor

@felixcremer felixcremer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like to have something like that, but why are we restricting this to HTTPStore? Couldn't we make the remote and cache be AbstractStore? Then we could also cache data that comes from S3 or even data that comes from a slower hard disk into a hot loaded scratch folder in HPC.

@rafaqz
Copy link
Copy Markdown
Author

rafaqz commented Feb 24, 2026

Yeah, thats just my use case we can expand it - what needs to change? can you comment in the code? I dont have my head accross what all the stores are

@asinghvi17
Copy link
Copy Markdown
Member

Why not just CachedDiskArray? I guess I'm not sure why you want the store to be caching rather than the array.

@felixcremer
Copy link
Copy Markdown
Contributor

I think the main difference is, that CachedDiskArray is caching it into RAM or into temporary files, while this implementation of the CachingStore caches it into a different local store which would then be persistent between sessions or computer restarts.

@rafaqz
Copy link
Copy Markdown
Author

rafaqz commented Feb 26, 2026

Yeah, maybe a different name, but its caching to disk that we need.

Stored?

@asinghvi17
Copy link
Copy Markdown
Member

LocallyCached maybe?

```
"""
struct CachingHTTPStore <: AbstractStore
remote::HTTPStore
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this parameterized by the type of the remote and cache stores?

cache::DirectoryStore
end

function CachingHTTPStore(url::AbstractString, cache_path::AbstractString)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function CachingHTTPStore(url::AbstractString, cache_path::AbstractString)
function CachingStore(url::AbstractString, cache_path::AbstractString)

end

function CachingHTTPStore(url::AbstractString, cache_path::AbstractString)
CachingHTTPStore(HTTPStore(url), DirectoryStore(cache_path))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CachingHTTPStore(HTTPStore(url), DirectoryStore(cache_path))
CachingHTTPStore(storefromstring(url), storefromstring(cache_path))

end

# Accept any object with url and cache fields (e.g., RasterDataSources.CachedCloudSource)
CachingHTTPStore(source) = CachingHTTPStore(source.url, source.cache)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that should rather live in the RasterDataSource extension on Zarr?
This seems to be surprising in Zarr.

@felixcremer
Copy link
Copy Markdown
Contributor

For the naming, I think StoredStore is a bit confusing. LocalCache maybe or Hoard but that might be a bit too obscure.

I am wondering whether the local cache would be usable as a zarr in its own right?
Also should we add the path to the remote to the metadata of the cache?

@felixcremer
Copy link
Copy Markdown
Contributor

I just discussed this with @meggart and he would like to rather have this living as CachedDiskArray. The idea is to open up the CachedDiskArray Interface to enable any cache type and having a Zarr store or zarr array as a cache. This enables a more flexible use to also cache reprojected data or computation results.

We will have a new chunkfromcache function and then this could dispatch on the cache type.

@asinghvi17
Copy link
Copy Markdown
Member

asinghvi17 commented Mar 5, 2026

So what @felixcremer and I discussed today is that it's better if the actual cached diskarray is backed by a ZArray - but we can have a constructor method for CachedDiskArray that takes in a store and a path within it. Would that satisfy your usecase @rafaqz? Then you can do group structures etc., but the actual path would have to be handled by the application that constructs the cached disk array, like RasterDataSources or Rasters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants