Skip to content

Add Redundant Validation for Normalization Stats and Post-Inference Un-Normalization** #39

@SBeairsto

Description

@SBeairsto

🚀 Issue: Add Redundant Validation for Normalization Stats and Post-Inference Un-Normalization

Summary:
To ensure consistent and traceable data transformations in the ML workflow, we will implement redundant storage and validation of normalization statistics (min/max). These statistics will be used to un-normalize the model output after inference, ensuring consistency with the training-time standardization.

The normalization .json will also be consumed by a separate pre-processing pipeline responsible for preparing LR inference inputs. This issue covers only the un-normalization and validation logic.


📋 Tasks

Preprocessing Step (in nc2pt):

  • Save normalization statistics to a normalization.json file for each variable during training preprocessing

    • Fields: min, max, variable, method, created, etc.
  • Compute and include a hash (e.g. SHA256) of the JSON content to allow validation

Model Export (TorchScript):

  • Embed a copy of the normalization stats (and/or JSON hash) into the saved TorchScript model

    • Either via metadata dict or as attributes on a scripted module

Inference Step:

  • Load the normalization.json file used for the variable

  • Load the normalization metadata from the TorchScript model

  • Validate that the loaded stats match those embedded in the model

    • Value check or hash comparison
  • Apply un-normalization to the model-generated output:

    output_real_scale = output * (max - min) + min
  • Save the un-normalized output as .zarr

Utilities & Docs:

  • Add a Standardizer class or utility with .denormalize() and .validate_against_model() methods

  • Document expected normalization.json format and validation logic

  • Mention that standardization of LR input is handled in a separate preprocessing codebase


🧠 Notes

  • Preprocessing of LR input (e.g., standardization and windowing) is handled externally in a separate codebase due to memory constraints.

  • This issue strictly handles:

    • Emitting reusable normalization metadata

    • Ensuring safe reuse during inference

    • Applying post-inference un-normalization for saving HR results

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions