Skip to content

Commit ae2fca6

Browse files
authored
Merge pull request #40 from climagination/alignment_pipeline_refactor
Add customizable alignment pipeline per model via alignment_pipeline config field
2 parents 9afc17f + df7e7d5 commit ae2fca6

File tree

11 files changed

+420
-206
lines changed

11 files changed

+420
-206
lines changed

README.md

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -36,10 +36,37 @@ High-level workflow
3636
2. slices data to a pre-determined range of dates
3737
3. aligns the grids via interpolation, crops them to be the same size, and coarsens the low-resolution fields by the configured scale factor
3838
4. applies user defined transforms like unit conversions or log transformations
39-
5. splits into a train and test dataset and standardizes both datasets based on the mean and standard deviation of all grids from the training data only (also writes this information into the zarr metadata for inference)
40-
6. writes to `.zarr`
41-
7. `nc2pt/tools/zarr_to_torch.py` - writes to PyTorch files
42-
8. `nc2pt/tools/single_file_to_batches.py` - batches the single PyTorch files
39+
5. optionally splits into train/test/validation based on years defined in the config
40+
6. standardizes/normalizes the datasets based on statistics computed from the training data (or full data, if no split). These statistics are stored in the metadata.
41+
7. writes to `.zarr`
42+
8. `nc2pt/tools/zarr_to_torch.py` - writes to PyTorch files
43+
9. `nc2pt/tools/single_file_to_batches.py` - batches the single PyTorch files
44+
45+
## Customizable Pipelines 🚦
46+
47+
Each model can define its own custom preprocessing steps by listing them in order via the `alignment_pipeline` field of the model YAML. Steps include:
48+
49+
- `temporal_crop`
50+
51+
- `regrid`
52+
53+
- `spatial_crop`
54+
55+
- `coarsen`
56+
57+
- `user_defined_transforms`
58+
59+
- `data_split`
60+
61+
62+
By default, all 6 are applied. You can exclude or reorder them by editing the YAML, e.g.:
63+
64+
```
65+
alignment_pipeline:
66+
- temporal_crop
67+
- regrid
68+
- spatial_crop
69+
```
4370

4471
## What are the downsides of using PyTorch files for climate data?
4572
The most obvious downside is that you lose the metadata associated with a netCDF dataset. The intermediate Zarr format produced by nc2pt allows for parallelized io and perserves the metadata. This is useful for inference.
@@ -107,6 +134,13 @@ To add a new model:
107134
_target_: nc2pt.climatedata.ClimateModel
108135
name: my_model
109136
info: "My custom climate model"
137+
alignment_pipeline:
138+
- temporal_crop
139+
- regrid
140+
- spatial_crop
141+
- coarsen
142+
- user_defined_transforms
143+
- data_split
110144
climate_variables:
111145
- ${internal.my_model_pr}
112146
- ${internal.my_model_tas}
@@ -154,6 +188,7 @@ To add a new variable to an existing model (e.g., `hr`):
154188
apply_standardize: true
155189
apply_normalize: true
156190
invariant: false
191+
transform: ["x * 69 + 420"]
157192
```
158193

159194
2. **Register and alias it in `injections.yaml`**

0 commit comments

Comments
 (0)