-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Cloud-native geospatial I/O in R: historical context and the opportunity ahead
A discussion note — March 2026
Where we came from
For most of the history of scientific computing with spatial data, GDAL has been
the indispensable layer. It is not merely a library — it is the accumulated
product of decades of reverse-engineering, edge-case handling, and format
archaeology across hundreds of raster and vector formats. HDF-EOS, GRIB, obscure
military formats, every NetCDF dialect anyone ever deployed — the knowledge
embedded in GDAL drivers is irreplaceable and was not easily won. The R spatial
ecosystem (gdalraster now) is explicit about this dependency.
In Python the picture got muddier. Rasterio presented itself as "GDAL with a
better API," and the framing stuck. For a generation of Python geospatial
practitioners, GDAL became invisible infrastructure — something conda sorted out —
and rasterio became the thing they actually understood. The osgeo.gdal/ogr/osr
community, building directly on the C API through conda rather than pypi, retained
a clearer model of what GDAL actually is. But the two communities diverged, and
GDAL's value got attributed to its wrapper while its friction got attributed to
itself.
This matters because it shaped what people thought was possible to go around.
What changed: cloud-native formats and async I/O
The Cloud-Optimised GeoTIFF (COG) was designed as a GDAL-compatible format. The
community that built it did not fully reckon with an implication: a sufficiently
correct TIFF reader makes GDAL optional for this format. COG's defining property —
the tile index in the header, tile data at known byte offsets — means that if you
can read the header and issue HTTP range requests, you have everything you need.
No driver architecture required. No VSI layer. No linking against a C library.
The Rust ecosystem then provided two pieces that make this practical:
-
object_store (Apache arrow-rs) — a unified async I/O layer for S3, GCS,
Azure, HTTP, and local filesystems, used by Delta Lake, DataFusion, and the
broader Arrow ecosystem. Fast, correct, zero geospatial assumptions. -
async-tiff — a TIFF metadata and tile fetch/decode library built on
object_store. It reads the IFD structure, exposes the overview pyramid, and
fetches and decodes tiles. It knows about COG layout, compression codecs, and
data types. It does not know about GDAL.
Kyle Barron (Development Seed) recognised this and built obstore — direct
Python bindings to object_store, the same storage layer rasterio uses internally
but exposed directly. The point is not to replace rasterio but to be honest about
which layer is doing what.
The R opportunity
Two new R packages sit on this foundation:
rustycogs wraps async-tiff. It exposes:
tiff_ifd_info()— the grid specification and overview pyramid for a COG,
returned as a data frame. Stable across all files sharing the same grid.tiff_refs()— the tile manifest: offset, length, spatial position, and
metadata for every tile at a given overview level. Ephemeral, per-file, fast.tiff_read_tiles()— fetch and decode a selected set of tiles to arrays.
robstore wraps object_store directly. It exposes store constructors
(s3_store, http_store, local_store) and generic I/O operations
(ob_get_range, ob_get_ranges, ob_put, ob_list). It has no concept of
TIFF, COG, or any geospatial format. It is a generic cloud storage client.
These two packages are independent — rustycogs does not depend on robstore. Both
sit directly on the same Rust crate. The dependency graph is flat:
object_store (Rust)
├── async-tiff (Rust) → rustycogs (R)
└────────────────────── robstore (R)
The key performance principle: the R boundary is expensive. Keep pipelines in
Rust; surface to R only what the user needs to inspect or manipulate. A single
block_on that fetches and decodes a batch of tiles is faster than
fetch-to-R-then-decode, even if the arithmetic is the same.
The tile index as the central primitive
The insight that unlocks the framework is this: the grid specification is
stable; the tile refs are ephemeral.
For a collection like the GHRSST SST archive — 8,600+ daily files, all on the
same grid — you scan the IFD structure once. It does not change. The overview
pyramid, the tile dimensions, the CRS, the geotransform: these are fixed
properties of the collection, not of individual files.
The tile refs — byte offsets and lengths — are per-file, but cheap to obtain
(the COG header is small) and trivially cacheable.
This separation makes the following workflow natural:
# 1. Scan once — stable, shared across the entire collection
spec <- tiff_ifd_info(any_ghrsst_url)
# 2. Plan in R — this is where the interesting work happens
ifd <- best_overview(spec, target_resolution) # pick overview level
refs <- tiff_refs(url_for_date(date), ifd = ifd) # scan this file
tiles <- filter_tiles(refs, extent) # spatial filter
# 3. Fetch — one vectorised range request
tiff_read_tiles(tiles) # decoded arrays, or:
ob_get_ranges(store, path, tiles$offset, tiles$length) # raw bytesStep 2 is pure R on a data frame. Steps 1 and 3 are Rust. The boundary is clean
and the cost is proportional to actual work done.
crop() becomes a spatial filter on a data frame — no data movement, no network
traffic. plot() selects the appropriate overview level automatically — the
resolution pyramid is already in the file, decimation is not needed. collect()
is the explicit trigger for I/O. The lazy/collect pattern is not a terra idiom
but it is the honest description of what is happening.
Implications for raadtools and collection-scale reading
The raadtools package currently contains a reader function per dataset — read_*
functions with bespoke file path logic, format-specific handling, and terra/raster
gymnastics accumulated over many years. This is not a criticism; it reflects what
was necessary when the underlying stack required it.
With COG-ified collections the reader code collapses. Every dataset that has been
converted to COG gets the same generic interface for free. The per-dataset
knowledge that remains is:
- The URL template (how to construct a file URL from a timestamp)
- The time index (a lookup table from date to URL)
- The grid spec (scanned once, cached)
The actual read path — overview selection, spatial filtering, tile fetch, decode,
mosaic — is identical across GHRSST, NSIDC, OISST, CCMP, AMSR, and any other
regular grid collection.
This is not a replacement for raadtools' domain knowledge — the time indexing,
the understanding of what each product represents, the quality flag handling.
It is a replacement for the mechanical I/O layer underneath, which can be generic
because COG is a generic format.
The result could be a raadtools-terra UX: familiar terra idioms (rast(),
crop(), plot()) backed by this machinery, lazy by default, resolution-aware
automatically, working identically whether the files are local, on S3, or behind
an HTTP server.
What GDAL remains essential for
This is not a story about GDAL being superseded. It is a story about GDAL being
used for what it is actually good at and not pressed into service for things it
was not designed for.
GDAL remains essential for:
- Format archaeology — HDF4, HDF-EOS, GRIB, ancient binary formats, vendor
formats. No Rust crate is going to replicate this anytime soon, nor should it. - Legacy file stores — indexing and virtualising what already exists, which
is overwhelmingly not COG. VRT is underappreciated as a generic virtualisation
layer for heterogeneous collections. - The warp engine — reproject-and-resample is genuinely hard, well-tested in
GDAL, and handles corner cases that would take years to reproduce correctly. - Writing — COG creation, format conversion, the full write path for anything
that isn't a cloud-native write.
The right framing is that GDAL should be modularised in how it is used — reach
for the warp engine when you need warping, the format driver when you have a
legacy format, the VRT layer when you need virtualisation. Stop using it as the
I/O layer for files specifically designed not to need it.
gdalraster is doing this at the R level already — exposing GDAL's useful parts
more directly rather than hiding them behind higher-level abstractions. That
direction is correct and complementary to what rustycogs/robstore are doing.
The disruption is structural, not spectacular
The disruption this enables is not a new GIS platform or a dramatic announcement.
It is structural: the assumption that "geospatial I/O in R means GDAL" stops being
true for the growing class of data that lives in cloud-native formats.
The people who will see this coming are those already operating COG + STAC +
object storage and feeling the GDAL friction daily. The people who will not see it
coming are those who treat sf + terra as stable ground and do not look at what is
underneath.
The framework is not complete and the packages are alpha. But the key architectural
decisions are right:
- object_store as the I/O primitive — cloud-agnostic, async, fast
- async-tiff as the TIFF/COG layer — format-specific, decode-capable
- tile refs as the central planning primitive — separating index from fetch
- R as the coordination layer — spatial filtering, time indexing, user interface
- Rust for everything performance-sensitive — minimal R boundary crossings
The raadtools collection as a COG stack with a generic lazy reader is the proof of
concept worth building. If it works for GHRSST it works for everything.
rustycogs and robstore are available on r-universe at
trackage.r-universe.dev. Both are alpha. Feedback welcome.