Adapting existing backend entrypoints to allow for lazy pre-processing #11198

muttener · 2026-02-23T09:48:59Z

muttener
Feb 23, 2026

I have a lot of data stored in NetCDF files, following a specific hierarchy, and I often write code similar to:

ds = xr.open_dataset(<path>, engine="h5netcdf")
ds = prep_custom_format(ds)

This processing step sets some attributes and lazily updates various data variables. (This could be as simple as rescaling the values.)

I'd like to get rid of this boilerplate, and have a custom backend entrypoint such that I could write:

ds = xr.open_dataset(<path>, engine="my-custom-engine")

Following the well-written development guide, I was almost able to do just that. The following class was defined and registered as backend entrypoint. (This code has been adapted to make it shorter. In reality, I allow all arguments that the h5netcdf backend can handle and simply pass them along.)

import xarray as xr
from xarray.backends import BackendEntrypoint
from xarray.backends.h5netcdf_ import H5netcdfBackendEntrypoint

def prep_custom_format(ds: xr.Dataset) -> xr.Dataset:
    ...
 
class MyBackendEntrypoint(BackendEntrypoint):
    def open_dataset(self, filename_or_obj):
        ds = H5netcdfBackendEntrypoint().open_dataset(filename_or_obj)
        ds = prep_custom_format(ds)
        return ds

The custom engine has been registered correctly and works as expected, except that it does not work lazily. Even if the pre-processing step is removed, my RAM usage surges when opening (not loading) a dataset. This was quite surprising to me.

I'd like to know if a backend can lazily pre-process data, or if this approach is a dead-end. I understand that xr.open_dataset and H5netcdfBackendEntrypoint().open_dataset work on a different level, e.g., only the former can output chunked Dask arrays, so maybe it is simply not possible right now, or it is not meant to be used this way.

Answered by keewis

Feb 24, 2026

while I would advise against using custom backends in this case for the reason that it is fragile (and unnecessarily limiting to a single on-disk format / engine), note that it is technically possible to do this by making use of private API.

The way the backends work is that the backend array type (a type that can fetch bytes from disk and convert to a numpy array) is wrapped in a internal lazy array type (LazilyIndexedArray), and when decoding multiple functions are applied to this using lazy_elemwise_func which takes care not to delay loading the data into memory until told to (for example, when accessing .data). Being private API, however, this function can be moved around or changed w…

View full answer

keewis · 2026-02-24T23:47:48Z

keewis
Feb 24, 2026
Maintainer

while I would advise against using custom backends in this case for the reason that it is fragile (and unnecessarily limiting to a single on-disk format / engine), note that it is technically possible to do this by making use of private API.

The way the backends work is that the backend array type (a type that can fetch bytes from disk and convert to a numpy array) is wrapped in a internal lazy array type (LazilyIndexedArray), and when decoding multiple functions are applied to this using lazy_elemwise_func which takes care not to delay loading the data into memory until told to (for example, when accessing .data). Being private API, however, this function can be moved around or changed without warning, so be aware of that.

There are some proposals to allow writing custom variable and dataset coders (which would be the hook you're looking for), but this doesn't exist for now. I personally use the first pattern for this, but slightly rewritten to:

ds = xr.open_dataset(...).pipe(prep_custom_format)

or, with a custom accessor:

import my_accessor_module

ds = xr.open_dataset(...).my_accessor.decode()

0 replies

muttener · 2026-02-26T22:13:33Z

muttener
Feb 26, 2026
Author

Thanks for your insight!

There are some proposals to allow writing custom variable and dataset coders (which would be the hook you're looking for), but this doesn't exist for now.

I found #68 and #8548. Do you know of any other issues, PRs, etc that I can subscribe to?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adapting existing backend entrypoints to allow for lazy pre-processing #11198

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Adapting existing backend entrypoints to allow for lazy pre-processing #11198

Uh oh!

muttener Feb 23, 2026

Replies: 2 comments

Uh oh!

Uh oh!

keewis Feb 24, 2026 Maintainer

Uh oh!

muttener Feb 26, 2026 Author

muttener
Feb 23, 2026

keewis
Feb 24, 2026
Maintainer

muttener
Feb 26, 2026
Author