Skip to content

discussion doc #2

@mdsumner

Description

@mdsumner

Cloud-native geospatial I/O in R: historical context and the opportunity ahead

A discussion note — March 2026


Where we came from

For most of the history of scientific computing with spatial data, GDAL has been
the indispensable layer. It is not merely a library — it is the accumulated
product of decades of reverse-engineering, edge-case handling, and format
archaeology across hundreds of raster and vector formats. HDF-EOS, GRIB, obscure
military formats, every NetCDF dialect anyone ever deployed — the knowledge
embedded in GDAL drivers is irreplaceable and was not easily won. The R spatial
ecosystem (gdalraster now) is explicit about this dependency.

In Python the picture got muddier. Rasterio presented itself as "GDAL with a
better API," and the framing stuck. For a generation of Python geospatial
practitioners, GDAL became invisible infrastructure — something conda sorted out —
and rasterio became the thing they actually understood. The osgeo.gdal/ogr/osr
community, building directly on the C API through conda rather than pypi, retained
a clearer model of what GDAL actually is. But the two communities diverged, and
GDAL's value got attributed to its wrapper while its friction got attributed to
itself.

This matters because it shaped what people thought was possible to go around.


What changed: cloud-native formats and async I/O

The Cloud-Optimised GeoTIFF (COG) was designed as a GDAL-compatible format. The
community that built it did not fully reckon with an implication: a sufficiently
correct TIFF reader makes GDAL optional for this format. COG's defining property —
the tile index in the header, tile data at known byte offsets — means that if you
can read the header and issue HTTP range requests, you have everything you need.
No driver architecture required. No VSI layer. No linking against a C library.

The Rust ecosystem then provided two pieces that make this practical:

  • object_store (Apache arrow-rs) — a unified async I/O layer for S3, GCS,
    Azure, HTTP, and local filesystems, used by Delta Lake, DataFusion, and the
    broader Arrow ecosystem. Fast, correct, zero geospatial assumptions.

  • async-tiff — a TIFF metadata and tile fetch/decode library built on
    object_store. It reads the IFD structure, exposes the overview pyramid, and
    fetches and decodes tiles. It knows about COG layout, compression codecs, and
    data types. It does not know about GDAL.

Kyle Barron (Development Seed) recognised this and built obstore — direct
Python bindings to object_store, the same storage layer rasterio uses internally
but exposed directly. The point is not to replace rasterio but to be honest about
which layer is doing what.


The R opportunity

Two new R packages sit on this foundation:

rustycogs wraps async-tiff. It exposes:

  • tiff_ifd_info() — the grid specification and overview pyramid for a COG,
    returned as a data frame. Stable across all files sharing the same grid.
  • tiff_refs() — the tile manifest: offset, length, spatial position, and
    metadata for every tile at a given overview level. Ephemeral, per-file, fast.
  • tiff_read_tiles() — fetch and decode a selected set of tiles to arrays.

robstore wraps object_store directly. It exposes store constructors
(s3_store, http_store, local_store) and generic I/O operations
(ob_get_range, ob_get_ranges, ob_put, ob_list). It has no concept of
TIFF, COG, or any geospatial format. It is a generic cloud storage client.

These two packages are independent — rustycogs does not depend on robstore. Both
sit directly on the same Rust crate. The dependency graph is flat:

object_store (Rust)
    ├── async-tiff (Rust)  →  rustycogs (R)
    └──────────────────────  robstore (R)

The key performance principle: the R boundary is expensive. Keep pipelines in
Rust; surface to R only what the user needs to inspect or manipulate. A single
block_on that fetches and decodes a batch of tiles is faster than
fetch-to-R-then-decode, even if the arithmetic is the same.


The tile index as the central primitive

The insight that unlocks the framework is this: the grid specification is
stable; the tile refs are ephemeral
.

For a collection like the GHRSST SST archive — 8,600+ daily files, all on the
same grid — you scan the IFD structure once. It does not change. The overview
pyramid, the tile dimensions, the CRS, the geotransform: these are fixed
properties of the collection, not of individual files.

The tile refs — byte offsets and lengths — are per-file, but cheap to obtain
(the COG header is small) and trivially cacheable.

This separation makes the following workflow natural:

# 1. Scan once — stable, shared across the entire collection
spec <- tiff_ifd_info(any_ghrsst_url)

# 2. Plan in R — this is where the interesting work happens
ifd   <- best_overview(spec, target_resolution)   # pick overview level
refs  <- tiff_refs(url_for_date(date), ifd = ifd)  # scan this file
tiles <- filter_tiles(refs, extent)                # spatial filter

# 3. Fetch — one vectorised range request
tiff_read_tiles(tiles)   # decoded arrays, or:
ob_get_ranges(store, path, tiles$offset, tiles$length)  # raw bytes

Step 2 is pure R on a data frame. Steps 1 and 3 are Rust. The boundary is clean
and the cost is proportional to actual work done.

crop() becomes a spatial filter on a data frame — no data movement, no network
traffic. plot() selects the appropriate overview level automatically — the
resolution pyramid is already in the file, decimation is not needed. collect()
is the explicit trigger for I/O. The lazy/collect pattern is not a terra idiom
but it is the honest description of what is happening.


Implications for raadtools and collection-scale reading

The raadtools package currently contains a reader function per dataset — read_*
functions with bespoke file path logic, format-specific handling, and terra/raster
gymnastics accumulated over many years. This is not a criticism; it reflects what
was necessary when the underlying stack required it.

With COG-ified collections the reader code collapses. Every dataset that has been
converted to COG gets the same generic interface for free. The per-dataset
knowledge that remains is:

  • The URL template (how to construct a file URL from a timestamp)
  • The time index (a lookup table from date to URL)
  • The grid spec (scanned once, cached)

The actual read path — overview selection, spatial filtering, tile fetch, decode,
mosaic — is identical across GHRSST, NSIDC, OISST, CCMP, AMSR, and any other
regular grid collection.

This is not a replacement for raadtools' domain knowledge — the time indexing,
the understanding of what each product represents, the quality flag handling.
It is a replacement for the mechanical I/O layer underneath, which can be generic
because COG is a generic format.

The result could be a raadtools-terra UX: familiar terra idioms (rast(),
crop(), plot()) backed by this machinery, lazy by default, resolution-aware
automatically, working identically whether the files are local, on S3, or behind
an HTTP server.


What GDAL remains essential for

This is not a story about GDAL being superseded. It is a story about GDAL being
used for what it is actually good at and not pressed into service for things it
was not designed for.

GDAL remains essential for:

  • Format archaeology — HDF4, HDF-EOS, GRIB, ancient binary formats, vendor
    formats. No Rust crate is going to replicate this anytime soon, nor should it.
  • Legacy file stores — indexing and virtualising what already exists, which
    is overwhelmingly not COG. VRT is underappreciated as a generic virtualisation
    layer for heterogeneous collections.
  • The warp engine — reproject-and-resample is genuinely hard, well-tested in
    GDAL, and handles corner cases that would take years to reproduce correctly.
  • Writing — COG creation, format conversion, the full write path for anything
    that isn't a cloud-native write.

The right framing is that GDAL should be modularised in how it is used — reach
for the warp engine when you need warping, the format driver when you have a
legacy format, the VRT layer when you need virtualisation. Stop using it as the
I/O layer for files specifically designed not to need it.

gdalraster is doing this at the R level already — exposing GDAL's useful parts
more directly rather than hiding them behind higher-level abstractions. That
direction is correct and complementary to what rustycogs/robstore are doing.


The disruption is structural, not spectacular

The disruption this enables is not a new GIS platform or a dramatic announcement.
It is structural: the assumption that "geospatial I/O in R means GDAL" stops being
true for the growing class of data that lives in cloud-native formats.

The people who will see this coming are those already operating COG + STAC +
object storage and feeling the GDAL friction daily. The people who will not see it
coming are those who treat sf + terra as stable ground and do not look at what is
underneath.

The framework is not complete and the packages are alpha. But the key architectural
decisions are right:

  • object_store as the I/O primitive — cloud-agnostic, async, fast
  • async-tiff as the TIFF/COG layer — format-specific, decode-capable
  • tile refs as the central planning primitive — separating index from fetch
  • R as the coordination layer — spatial filtering, time indexing, user interface
  • Rust for everything performance-sensitive — minimal R boundary crossings

The raadtools collection as a COG stack with a generic lazy reader is the proof of
concept worth building. If it works for GHRSST it works for everything.


rustycogs and robstore are available on r-universe at
trackage.r-universe.dev. Both are alpha. Feedback welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions