GitHub - climagination/nc2pt: Serializing NetCDF files for efficient use in deep learning pipelines.

The Problem

NetCDF4 files, commonly used for storing climate and earth systems data, are not optimized for use with most machine learning applications with heavy io requirements or datasets that are simply too large to hold in GPU/CPU memory.

How does nc2pt help?

It performs a preprocessing flow on climate fields and converts them from NetCDF4 (.nc) to an intermediate file format Zarr (.zarr) which allows for the parallel loading and writing to individual PyTorch Lightning files (.pt) that can be loaded directly onto GPUs.

What intended use cases of nc2pt?

standardizing and making metadata uniform between datasets
aligns different grids perfectly by re-projecting them onto one another -- nc2pt projects the low-resolution (lr) regular grids onto high-resolution curvilinear grids (hr). This step can be configured to suit specific datasets.
selects individual years as test years or training years
organizes code into input (lr) or output (hr) fields
Designed for O(terabyte) datasets

What preprocessing steps does nc2pt do? 🤔

High-level workflow

configures metadata between the datasets as defined in the config
slices data to a pre-determined range of dates
aligns the grids via interpolation, crops them to be the same size, and coarsens the low-resolution fields by the configured scale factor
applies user defined transforms like unit conversions or log transformations
optionally splits into train/test/validation based on years defined in the config
standardizes/normalizes the datasets based on statistics computed from the training data (or full data, if no split). These statistics are stored in the metadata.
writes to .zarr
nc2pt/tools/zarr_to_torch.py - writes to PyTorch files
nc2pt/tools/single_file_to_batches.py - batches the single PyTorch files

Special Scaling for Precipitation

When preprocessing with nc2pt, precipitation (pr) is handled differently:

A log-transform is automatically applied to precipitation values, with a small constant ϵ = 10 **-3: scaled = log(P+ϵ) - log(ϵ)/ log(max(P)+ϵ) - log(ϵ)
This normalization is applied to the pr variable alone, other variables use standard normalization.
Specifying a log-transform for pr is not required as it will cause the log-transformed twice.
A log message will inform you each time this special scaling is performed.

Customizable Pipelines 🚦

Each model can define its own custom preprocessing steps by listing them in order via the alignment_pipeline field of the model YAML. Steps include:

temporal_crop
regrid
spatial_crop
coarsen
user_defined_transforms
data_split

By default, all 6 are applied. You can exclude or reorder them by editing the YAML, e.g.:

alignment_pipeline:
  - temporal_crop
  - regrid
  - spatial_crop

What are the downsides of using PyTorch files for climate data?

The most obvious downside is that you lose the metadata associated with a netCDF dataset. The intermediate Zarr format produced by nc2pt allows for parallelized io and perserves the metadata. This is useful for inference.

Requirements

Python >= 3.8
Recommended: virtual environment (e.g. venv or virtualenv)

💽 Installation

Clone this repository:

git clone https://github.com/climagination/nc2pt.git
cd nc2pt

(Optional but recommended) Create and activate a virtual environment:
```
python -m venv .venv
source .venv/bin/activate
```
Install dependencies:
```
pip install -r requirements.txt
```
Install the package in editable mode:
```
pip install -e nc2pt/
```

That’s it!

📋 Configuration

nc2pt uses Hydra for flexible configuration. The main configuration file is conf/config.yaml, which defines:

The list of climate models to include (climate_models)
Global dimensions, coordinates, subsetting, and chunking
Output path and compute options

Each model and variable is defined in separate YAML files under conf/climate_models/, making the pipeline modular and easily extensible.

➕ Adding a New Climate Model

To add a new model:

Create a model file
Place it at conf/climate_models/<name>/model.yaml. Example:

_target_:  nc2pt.climatedata.ClimateModel
name:  my_model
info:  "My custom climate model"
alignment_pipeline:
  - temporal_crop
  - regrid
  - spatial_crop
  - coarsen
  - user_defined_transforms
  - data_split
climate_variables:
    -  ${internal.my_model_pr}
    -  ${internal.my_model_tas}

Register it in injections.yaml

default:
	-  climate_models/my_model@internal._my_model  
	-  climate_models/my_model/pr@internal._my_model_pr
	-  climate_models/my_model/tas@internal._my_model_tas

Expose the aliases in injections.yaml

internal:
	my_model:  ${internal._my_model}
	my_model_pr:  ${internal._my_model_pr}
	my_model_tas:  ${internal._my_model_tas}

Enable it in config.yaml

climate_models:
   -  ${internal.my_model}

➕ Adding a New Climate Variable

To add a new variable to an existing model (e.g., hr):

Create a variable file
Place it at conf/climate_models/hr/zg.yaml:

 _target_:  nc2pt.climatedata.ClimateVariable
 name:  "zg"
 alternative_names: ["zg"]
 path:  ${internal.paths.hr.zg}
 apply_standardize:  true
 apply_normalize:  true
 invariant:  false
 transform: ["x * 69 + 420"]

Register and alias it in injections.yaml

defaults:
	- climate_models/hr/zg@internal._hr_zg
internal:
	hr_zg:  ${internal._hr_zg}

Add it to the model’s variable list
In conf/climate_models/hr.yaml:

 climate_variables:
     -  ${internal.hr_zg}  # other variables...

That’s it — your new model or variable will now be included in the pipeline when preprocess.py is run.

🚀 Running

Explore data and ensure compatibility
Set up your configuration:

Edit conf/config.yaml to include the models you want to use under climate_models:
For each model, go to its model.yaml file and uncomment (or add) the variables you want included

Run the nc2pt/preprocess.py script which will run through your preprocessing steps. This creates the zarr files
Run the nc2pt/tools/zarr_to_torch.py script which serializes each time step in the .zarr file to an individual PyTorch .pt file.
Optional: run the nc2pt/tools/single_files_to_batches.py which combines individual files from the previous step into random batches. This setup allows for less io in your machine learning pipeline.

Testing

Testing is done with pytest. The easiest way to perform tests is to install pytest and use the command: pytest --cov-report term-missing --cov=nc2pt .

It will generate a coverage report and automatically use files prepended with test_*.py in nc2pt/tests

📝 Notes

Chunking Sensitivity:
The preprocessing pipeline is sensitive to how datasets are chunked in memory. If you encounter memory errors or Dask worker crashes, reviewing and adjusting the chunk sizes is a good first step. See closed issue #18 for details and suggestions.
Interpolation Method:
The current interpolation method uses xarray’s native 2D interpolation, which does not account for Earth curvature. This repository previously used an xESMF-backed interpolation scheme that performed regridding on spherical geometry. However, within the scope of this work, it was found that the difference in performance was negligible, so the dependency on xESMF was removed. See closed issue #15 for more context.

Name		Name	Last commit message	Last commit date
Latest commit History 323 Commits
.github/workflows		.github/workflows
nc2pt		nc2pt
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Problem

How does nc2pt help?

What intended use cases of nc2pt?

What preprocessing steps does nc2pt do? 🤔

Special Scaling for Precipitation

Customizable Pipelines 🚦

What are the downsides of using PyTorch files for climate data?

Requirements

💽 Installation

📋 Configuration

➕ Adding a New Climate Model

➕ Adding a New Climate Variable

🚀 Running

Testing

📝 Notes

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Problem

How does nc2pt help?

What intended use cases of nc2pt?

What preprocessing steps does nc2pt do? 🤔

Special Scaling for Precipitation

Customizable Pipelines 🚦

What are the downsides of using PyTorch files for climate data?

Requirements

💽 Installation

📋 Configuration

➕ Adding a New Climate Model

➕ Adding a New Climate Variable

🚀 Running

Testing

📝 Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages