You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+39-4Lines changed: 39 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,10 +36,37 @@ High-level workflow
36
36
2. slices data to a pre-determined range of dates
37
37
3. aligns the grids via interpolation, crops them to be the same size, and coarsens the low-resolution fields by the configured scale factor
38
38
4. applies user defined transforms like unit conversions or log transformations
39
-
5. splits into a train and test dataset and standardizes both datasets based on the mean and standard deviation of all grids from the training data only (also writes this information into the zarr metadata for inference)
40
-
6. writes to `.zarr`
41
-
7.`nc2pt/tools/zarr_to_torch.py` - writes to PyTorch files
42
-
8.`nc2pt/tools/single_file_to_batches.py` - batches the single PyTorch files
39
+
5. optionally splits into train/test/validation based on years defined in the config
40
+
6. standardizes/normalizes the datasets based on statistics computed from the training data (or full data, if no split). These statistics are stored in the metadata.
41
+
7. writes to `.zarr`
42
+
8.`nc2pt/tools/zarr_to_torch.py` - writes to PyTorch files
43
+
9.`nc2pt/tools/single_file_to_batches.py` - batches the single PyTorch files
44
+
45
+
## Customizable Pipelines 🚦
46
+
47
+
Each model can define its own custom preprocessing steps by listing them in order via the `alignment_pipeline` field of the model YAML. Steps include:
48
+
49
+
-`temporal_crop`
50
+
51
+
-`regrid`
52
+
53
+
-`spatial_crop`
54
+
55
+
-`coarsen`
56
+
57
+
-`user_defined_transforms`
58
+
59
+
-`data_split`
60
+
61
+
62
+
By default, all 6 are applied. You can exclude or reorder them by editing the YAML, e.g.:
63
+
64
+
```
65
+
alignment_pipeline:
66
+
- temporal_crop
67
+
- regrid
68
+
- spatial_crop
69
+
```
43
70
44
71
## What are the downsides of using PyTorch files for climate data?
45
72
The most obvious downside is that you lose the metadata associated with a netCDF dataset. The intermediate Zarr format produced by nc2pt allows for parallelized io and perserves the metadata. This is useful for inference.
@@ -107,6 +134,13 @@ To add a new model:
107
134
_target_: nc2pt.climatedata.ClimateModel
108
135
name: my_model
109
136
info: "My custom climate model"
137
+
alignment_pipeline:
138
+
- temporal_crop
139
+
- regrid
140
+
- spatial_crop
141
+
- coarsen
142
+
- user_defined_transforms
143
+
- data_split
110
144
climate_variables:
111
145
- ${internal.my_model_pr}
112
146
- ${internal.my_model_tas}
@@ -154,6 +188,7 @@ To add a new variable to an existing model (e.g., `hr`):
0 commit comments