NetCDF4 files, commonly used for storing climate and earth systems data, are not optimized for use with most machine learning applications with heavy io requirements or datasets that are simply too large to hold in GPU/CPU memory.
It performs a preprocessing flow on climate fields and converts them from NetCDF4 (.nc) to an intermediate file format Zarr (.zarr) which allows for the parallel loading and writing to individual PyTorch Lightning files (.pt) that can be loaded directly onto GPUs.
-
standardizing and making metadata uniform between datasets
-
aligns different grids perfectly by re-projecting them onto one another -- nc2pt projects the low-resolution (lr) regular grids onto high-resolution curvilinear grids (hr). This step can be configured to suit specific datasets.
-
selects individual years as test years or training years
-
organizes code into input (lr) or output (hr) fields
-
Designed for O(terabyte) datasets
- configures metadata between the datasets as defined in the config
- slices data to a pre-determined range of dates
- aligns the grids via interpolation, crops them to be the same size, and coarsens the low-resolution fields by the configured scale factor
- applies user defined transforms like unit conversions or log transformations
- optionally splits into train/test/validation based on years defined in the config
- standardizes/normalizes the datasets based on statistics computed from the training data (or full data, if no split). These statistics are stored in the metadata.
- writes to
.zarr nc2pt/tools/zarr_to_torch.py- writes to PyTorch filesnc2pt/tools/single_file_to_batches.py- batches the single PyTorch files
When preprocessing with nc2pt, precipitation (pr) is handled differently:
-
A log-transform is automatically applied to precipitation values, with a small constant ϵ = 10 **-3: scaled = log(P+ϵ) - log(ϵ)/ log(max(P)+ϵ) - log(ϵ)
-
This normalization is applied to the pr variable alone, other variables use standard normalization.
-
Specifying a log-transform for pr is not required as it will cause the log-transformed twice.
-
A log message will inform you each time this special scaling is performed.
Each model can define its own custom preprocessing steps by listing them in order via the alignment_pipeline field of the model YAML. Steps include:
-
temporal_crop -
regrid -
spatial_crop -
coarsen -
user_defined_transforms -
data_split
By default, all 6 are applied. You can exclude or reorder them by editing the YAML, e.g.:
alignment_pipeline:
- temporal_crop
- regrid
- spatial_crop
The most obvious downside is that you lose the metadata associated with a netCDF dataset. The intermediate Zarr format produced by nc2pt allows for parallelized io and perserves the metadata. This is useful for inference.
- Python >= 3.8
- Recommended: virtual environment (e.g.
venvorvirtualenv)
-
Clone this repository:
git clone https://github.com/climagination/nc2pt.git cd nc2pt -
(Optional but recommended) Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Install the package in editable mode:
pip install -e nc2pt/
That’s it!
nc2pt uses Hydra for flexible configuration. The main configuration file is conf/config.yaml, which defines:
-
The list of climate models to include (
climate_models) -
Global dimensions, coordinates, subsetting, and chunking
-
Output path and compute options
Each model and variable is defined in separate YAML files under conf/climate_models/, making the pipeline modular and easily extensible.
To add a new model:
-
Create a model file
Place it atconf/climate_models/<name>/model.yaml. Example:_target_: nc2pt.climatedata.ClimateModel name: my_model info: "My custom climate model" alignment_pipeline: - temporal_crop - regrid - spatial_crop - coarsen - user_defined_transforms - data_split climate_variables: - ${internal.my_model_pr} - ${internal.my_model_tas}
-
Register it in
injections.yamldefault: - climate_models/my_model@internal._my_model - climate_models/my_model/pr@internal._my_model_pr - climate_models/my_model/tas@internal._my_model_tas
-
Expose the aliases in
injections.yamlinternal: my_model: ${internal._my_model} my_model_pr: ${internal._my_model_pr} my_model_tas: ${internal._my_model_tas}
-
Enable it in
config.yamlclimate_models: - ${internal.my_model}
To add a new variable to an existing model (e.g., hr):
-
Create a variable file
Place it atconf/climate_models/hr/zg.yaml:_target_: nc2pt.climatedata.ClimateVariable name: "zg" alternative_names: ["zg"] path: ${internal.paths.hr.zg} apply_standardize: true apply_normalize: true invariant: false transform: ["x * 69 + 420"]
-
Register and alias it in
injections.yamldefaults: - climate_models/hr/zg@internal._hr_zg internal: hr_zg: ${internal._hr_zg}
-
Add it to the model’s variable list
Inconf/climate_models/hr.yaml:climate_variables: - ${internal.hr_zg} # other variables...
That’s it — your new model or variable will now be included in the pipeline when preprocess.py is run.
- Explore data and ensure compatibility
- Set up your configuration:
- Edit
conf/config.yamlto include the models you want to use underclimate_models: - For each model, go to its
model.yamlfile and uncomment (or add) the variables you want included
- Run the
nc2pt/preprocess.pyscript which will run through your preprocessing steps. This creates the zarr files - Run the
nc2pt/tools/zarr_to_torch.pyscript which serializes each time step in the.zarrfile to an individual PyTorch.ptfile. - Optional: run the
nc2pt/tools/single_files_to_batches.pywhich combines individual files from the previous step into random batches. This setup allows for less io in your machine learning pipeline.
Testing is done with pytest. The easiest way to perform tests is to install pytest and use the command: pytest --cov-report term-missing --cov=nc2pt .
It will generate a coverage report and automatically use files prepended with test_*.py in nc2pt/tests
-
Chunking Sensitivity:
The preprocessing pipeline is sensitive to how datasets are chunked in memory. If you encounter memory errors or Dask worker crashes, reviewing and adjusting the chunk sizes is a good first step. See closed issue #18 for details and suggestions. -
Interpolation Method:
The current interpolation method uses xarray’s native 2D interpolation, which does not account for Earth curvature. This repository previously used anxESMF-backed interpolation scheme that performed regridding on spherical geometry. However, within the scope of this work, it was found that the difference in performance was negligible, so the dependency onxESMFwas removed. See closed issue #15 for more context.

