-
|
I have a lot of data stored in NetCDF files, following a specific hierarchy, and I often write code similar to: ds = xr.open_dataset(<path>, engine="h5netcdf")
ds = prep_custom_format(ds)This processing step sets some attributes and lazily updates various data variables. (This could be as simple as rescaling the values.) I'd like to get rid of this boilerplate, and have a custom backend entrypoint such that I could write: ds = xr.open_dataset(<path>, engine="my-custom-engine")Following the well-written development guide, I was almost able to do just that. The following class was defined and registered as backend entrypoint. (This code has been adapted to make it shorter. In reality, I allow all arguments that the import xarray as xr
from xarray.backends import BackendEntrypoint
from xarray.backends.h5netcdf_ import H5netcdfBackendEntrypoint
def prep_custom_format(ds: xr.Dataset) -> xr.Dataset:
...
class MyBackendEntrypoint(BackendEntrypoint):
def open_dataset(self, filename_or_obj):
ds = H5netcdfBackendEntrypoint().open_dataset(filename_or_obj)
ds = prep_custom_format(ds)
return dsThe custom engine has been registered correctly and works as expected, except that it does not work lazily. Even if the pre-processing step is removed, my RAM usage surges when opening (not loading) a dataset. This was quite surprising to me. I'd like to know if a backend can lazily pre-process data, or if this approach is a dead-end. I understand that |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
while I would advise against using custom backends in this case for the reason that it is fragile (and unnecessarily limiting to a single on-disk format / engine), note that it is technically possible to do this by making use of private API. The way the backends work is that the backend array type (a type that can fetch bytes from disk and convert to a There are some proposals to allow writing custom variable and dataset coders (which would be the hook you're looking for), but this doesn't exist for now. I personally use the first pattern for this, but slightly rewritten to: ds = xr.open_dataset(...).pipe(prep_custom_format)or, with a custom accessor: import my_accessor_module
ds = xr.open_dataset(...).my_accessor.decode() |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for your insight!
I found #68 and #8548. Do you know of any other issues, PRs, etc that I can subscribe to? |
Beta Was this translation helpful? Give feedback.
while I would advise against using custom backends in this case for the reason that it is fragile (and unnecessarily limiting to a single on-disk format / engine), note that it is technically possible to do this by making use of private API.
The way the backends work is that the backend array type (a type that can fetch bytes from disk and convert to a
numpyarray) is wrapped in a internal lazy array type (LazilyIndexedArray), and when decoding multiple functions are applied to this using lazy_elemwise_func which takes care not to delay loading the data into memory until told to (for example, when accessing.data). Being private API, however, this function can be moved around or changed w…