Skip to content

Feat pytorch calibrator#190

Draft
JemmaLDaniel wants to merge 25 commits intomainfrom
feat-pytorch-calibrator
Draft

Feat pytorch calibrator#190
JemmaLDaniel wants to merge 25 commits intomainfrom
feat-pytorch-calibrator

Conversation

@JemmaLDaniel
Copy link
Copy Markdown
Collaborator

Summary

Replaces the scikit-learn MLPClassifier-based calibrator with a custom PyTorch neural network, enabling larger and more customisable models for pre-trained calibration of InstaNovo predictions. Introduces a two-phase training pipeline (compute features to disk, then train from Parquet) to support training on large-scale datasets (20M+ spectra across 50+ projects) without loading everything into RAM.

Key changes

PyTorch calibrator & training pipeline

  • PyTorch calibrator: ProbabilityCalibrator now wraps a CalibratorNetwork (nn.Module) with configurable hidden dims, dropout, and training hyperparameters. Automatic GPU detection with CPU fallback.
  • safetensors serialisation: Models are saved as model.safetensors + config.json (architecture, hyperparameters, resolved feature configs via get_config()). Pickle and sklearn model support is dropped.
  • Training history: TrainingHistory dataclass records epoch-level train/val losses and accuracies, with JSON persistence and plotting.
  • Two-phase training pipeline: winnow train supports pre-computed features via features_path/val_features_path (single file or directory of Parquets), or a single-phase flow with automatic validation_fraction split (with data leakage warning).
  • FeatureDataset: PyTorch Dataset wrapper that loads from numpy arrays or Parquet files. from_parquet() supports directories for multi-project concatenation.
  • resolve_data_path: Resolves local paths or downloads from HuggingFace Hub, enabling HF-hosted datasets and models.
  • OmegaConf compatibility: get_config() converts DictConfig/ListConfig to plain Python types before JSON serialisation.
  • Progress bar for training: Calibrator training epochs are logged via tqdm progress bar.

Per-experiment iRT regression (merged from fix-per-experiment-irt-regression)

The following changes were developed on the fix-per-experiment-irt-regression branch and merged into this branch:

  • Per-experiment iRT regression: RetentionTimeFeature now trains a LinearRegression per experiment_name (replacing the single global MLPRegressor), with configurable min_train_points threshold.
  • iRT regressor serialisation: RetentionTimeFeature regressors saved/loaded via safetensors (keyed by experiment name) instead of pickle, with save_regressors()/load_regressors() API and CLI support via irt_regressor_output_path/irt_regressor_path.
  • Batched Koina calls: All per-experiment Koina iRT calls are batched into a single request for performance.
  • iRT metadata columns: Standardised to lowercase snake case.
  • Prosit → Koina rename: All user-facing references to "Prosit" renamed to "Koina" (e.g. prosit_intensity_model_nameintensity_model_name, check_valid_chimeric_prosit_predictioncheck_valid_chimeric_prediction).

Prediction column remapping (merged from feat-prediction-column-remapping-for-instanovo-backwards-compatibility)

The following changes were developed on a separate branch and merged into this branch:

  • Configurable prediction column names: InstaNovoDatasetLoader accepts a column_mapping dict for backwards compatibility with older InstaNovo versions that use different CSV column headers.
  • Optional residue_remapping: InstaNovoDatasetLoader.residue_remapping is now optional (defaults to None).

Data pipeline & build

  • Experiment-aware data loading: compute-features supports folders of per-experiment files, with experiment_name detection from DataFrame columns or file basenames.
  • Dual compute-features output: metadata_output_path (full CSV for EDA) and training_matrix_output_path (lean numeric Parquet for training).
  • New dependencies: torch, safetensors, and polars added as direct dependencies in pyproject.toml.
  • New winnow.utils package: Added with paths.py (resolve_data_path).
  • Makefile: Added docs, docs-serve, clean-docs, and check-build targets.
  • Documentation: Updated docs/api/calibration.md, docs/cli.md, and docs/configuration.md to reflect all changes.

Files changed (24 files, +2652/−761)

Area Files
Core winnow/calibration/calibrator.py, winnow/calibration/calibration_features.py
New modules winnow/datasets/feature_dataset.py, winnow/utils/paths.py
Data loaders winnow/datasets/data_loaders.py
CLI winnow/scripts/main.py
Configs calibrator.yaml, train.yaml, predict.yaml, compute_features.yaml, data_loader/instanovo.yaml
Docs docs/api/calibration.md, docs/cli.md, docs/configuration.md
Tests test_calibrator.py, test_calibration_features.py, test_data_loaders.py, test_feature_dataset.py, test_paths.py
Build pyproject.toml, requirements.txt, uv.lock, Makefile, .gitignore

Test plan

  • test_end_to_end_fit_predict — full pipeline: add features → compute → fit → predict
  • test_save_load_roundtrip — save, load, predict on new data (verifies importlib feature reconstruction)
  • test_save_load_weights_and_normalization_match — weights and feature_mean/std survive roundtrip exactly
  • test_early_stopping_triggers — patience exhausted with inverted val labels
  • test_get_config_converts_omegaconf_to_plain — DictConfig/ListConfig → plain Python, JSON-serialisable
  • test_fit_from_features_returns_history / test_fit_from_features_with_validation — two-phase training path
  • test_save_and_load_regressors_safetensors — iRT regressor roundtrip via safetensors
  • test_prepare_per_experiment / test_prepare_skips_preloaded_experiments — per-experiment iRT regression
  • test_hf_download_fallback / test_hf_download_failure_raises_with_context — mocked HF download path
  • test_dataloader_integration — FeatureDataset works with PyTorch DataLoader
  • test_from_parquet_directory_preserves_values — multi-file Parquet loading preserves values
  • test_default_column_mapping / test_custom_column_mapping_* / test_process_predictions_with_custom_column_mapping — prediction column remapping

…ing training history recording

fix: convert ProbabilityCalibrator OmegaConf objects into plain Python types
…s-compatibility' into feat-pytorch-calibrator
@JemmaLDaniel JemmaLDaniel self-assigned this Apr 14, 2026
@JemmaLDaniel JemmaLDaniel added the enhancement New feature or request label Apr 14, 2026
@github-actions
Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
data_types.py40100% 
calibration
   __init__.py00100% 
   calibration_features.py3871695%203–204, 285–286, 483, 770, 954, 958, 1292, 1308, 1316, 1417–1420, 1423
   calibrator.py293996%105, 189–190, 192, 220–221, 445–446, 449
compat
   __init__.py00100% 
   instanovo.py10640%12, 14–15, 17, 24–25
datasets
   __init__.py00100% 
   calibration_dataset.py1091784%155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
   data_loaders.py2761395%23, 205, 236–237, 434, 876, 880, 929, 940, 1054–1055, 1091–1092
   feature_dataset.py300100% 
   interfaces.py30100% 
   psm_dataset.py250100% 
fdr
   __init__.py00100% 
   base.py581574%81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
   database_grounded.py28196%52
   nonparametric.py25484%62, 68–69, 72
scripts
   __init__.py00100% 
   main.py2852850%8, 10–14, 16–17, 20–23, 26–27, 29–31, 35, 42, 47, 50, 56, 58–59, 62, 71, 79, 82, 89, 91–93, 95, 97–102, 105, 107–108, 113, 128, 131, 138–144, 147–148, 151, 177–179, 181, 183, 188, 190–192, 194, 196–197, 199–200, 202–204, 209, 211, 214, 218–221, 223–224, 226, 228, 231, 241–242, 244–246, 248–253, 255, 258, 269–270, 272–274, 276–277, 279–281, 283–284, 287, 298–299, 301, 303–304, 306, 314–316, 318–321, 323, 327, 332, 335, 343–345, 347–350, 352–355, 358, 370–374, 378, 381, 390–391, 395–398, 402–405, 409–411, 414, 417, 426–427, 431, 433–434, 439–441, 443–444, 447, 456–460, 462–465, 471, 491–493, 495, 497, 502, 504–506, 508–509, 511–513, 515–517, 519–520, 529, 538–539, 541, 543, 545, 548–549, 551–553, 560, 563, 577–579, 582, 585, 590, 592–594, 596–598, 600–601, 604–605, 608, 610–611, 613, 615, 617–618, 620, 623–624, 630–632, 634–637, 640–641, 644–645, 648–649, 652–653, 661–663, 667, 670, 674, 677, 700, 713–714, 717, 739, 751–752, 755, 780, 793–794, 797, 812, 824–825, 828, 840, 852–853, 856, 871, 883–884
utils
   __init__.py40100% 
   config_formatter.py534024%29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
   config_path.py76593%24–26, 117–118
   paths.py200100% 
   peptide.py160100% 
TOTAL170241175% 

Tests Skipped Failures Errors Time
335 0 💤 0 ❌ 0 🔥 34.317s ⏱️

@JemmaLDaniel JemmaLDaniel changed the base branch from main to fix-per-experiment-irt-regression April 14, 2026 10:48
@JemmaLDaniel JemmaLDaniel changed the base branch from fix-per-experiment-irt-regression to main April 14, 2026 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant