Feat pytorch calibrator by JemmaLDaniel · Pull Request #190 · instadeepai/winnow

JemmaLDaniel · 2026-04-14T08:27:29Z

Summary

Replaces the scikit-learn MLPClassifier-based calibrator with a custom PyTorch neural network, enabling larger and more customisable models for pre-trained calibration of InstaNovo predictions. Introduces a two-phase training pipeline (compute features to disk, then train from Parquet) to support training on large-scale datasets (20M+ spectra across 50+ projects) without loading everything into RAM.

Key changes

PyTorch calibrator & training pipeline

PyTorch calibrator: ProbabilityCalibrator now wraps a CalibratorNetwork (nn.Module) with configurable hidden dims, dropout, and training hyperparameters. Automatic GPU detection with CPU fallback.
safetensors serialisation: Models are saved as model.safetensors + config.json (architecture, hyperparameters, resolved feature configs via get_config()). Pickle and sklearn model support is dropped.
Training history: TrainingHistory dataclass records epoch-level train/val losses and accuracies, with JSON persistence and plotting.
Two-phase training pipeline: winnow train supports pre-computed features via features_path/val_features_path (single file or directory of Parquets), or a single-phase flow with automatic validation_fraction split (with data leakage warning).
FeatureDataset: PyTorch Dataset wrapper that loads from numpy arrays or Parquet files. from_parquet() supports directories for multi-project concatenation.
resolve_data_path: Resolves local paths or downloads from HuggingFace Hub, enabling HF-hosted datasets and models.
OmegaConf compatibility: get_config() converts DictConfig/ListConfig to plain Python types before JSON serialisation.
Progress bar for training: Calibrator training epochs are logged via tqdm progress bar.

Per-experiment iRT regression (merged from `fix-per-experiment-irt-regression`)

The following changes were developed on the fix-per-experiment-irt-regression branch and merged into this branch:

Per-experiment iRT regression: RetentionTimeFeature now trains a LinearRegression per experiment_name (replacing the single global MLPRegressor), with configurable min_train_points threshold.
iRT regressor serialisation: RetentionTimeFeature regressors saved/loaded via safetensors (keyed by experiment name) instead of pickle, with save_regressors()/load_regressors() API and CLI support via irt_regressor_output_path/irt_regressor_path.
Batched Koina calls: All per-experiment Koina iRT calls are batched into a single request for performance.
iRT metadata columns: Standardised to lowercase snake case.
Prosit → Koina rename: All user-facing references to "Prosit" renamed to "Koina" (e.g. prosit_intensity_model_name → intensity_model_name, check_valid_chimeric_prosit_prediction → check_valid_chimeric_prediction).

Prediction column remapping (merged from `feat-prediction-column-remapping-for-instanovo-backwards-compatibility`)

The following changes were developed on a separate branch and merged into this branch:

Configurable prediction column names: InstaNovoDatasetLoader accepts a column_mapping dict for backwards compatibility with older InstaNovo versions that use different CSV column headers.
Optional residue_remapping: InstaNovoDatasetLoader.residue_remapping is now optional (defaults to None).

Data pipeline & build

Experiment-aware data loading: compute-features supports folders of per-experiment files, with experiment_name detection from DataFrame columns or file basenames.
Dual compute-features output: metadata_output_path (full CSV for EDA) and training_matrix_output_path (lean numeric Parquet for training).
New dependencies: torch, safetensors, and polars added as direct dependencies in pyproject.toml.
New winnow.utils package: Added with paths.py (resolve_data_path).
Makefile: Added docs, docs-serve, clean-docs, and check-build targets.
Documentation: Updated docs/api/calibration.md, docs/cli.md, and docs/configuration.md to reflect all changes.

Files changed (24 files, +2652/−761)

Area	Files
Core	`winnow/calibration/calibrator.py`, `winnow/calibration/calibration_features.py`
New modules	`winnow/datasets/feature_dataset.py`, `winnow/utils/paths.py`
Data loaders	`winnow/datasets/data_loaders.py`
CLI	`winnow/scripts/main.py`
Configs	`calibrator.yaml`, `train.yaml`, `predict.yaml`, `compute_features.yaml`, `data_loader/instanovo.yaml`
Docs	`docs/api/calibration.md`, `docs/cli.md`, `docs/configuration.md`
Tests	`test_calibrator.py`, `test_calibration_features.py`, `test_data_loaders.py`, `test_feature_dataset.py`, `test_paths.py`
Build	`pyproject.toml`, `requirements.txt`, `uv.lock`, `Makefile`, `.gitignore`

Test plan

…ation and config arguments

…librator

…h models

…ing training history recording fix: convert ProbabilityCalibrator OmegaConf objects into plain Python types

…vo backwards compatibility

…s-compatibility' into feat-pytorch-calibrator

…formatting from TrainingHistory.plot()

github-actions · 2026-04-14T08:29:31Z

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	0	0	100%
data_types.py	4	0	100%
calibration
__init__.py	0	0	100%
calibration_features.py	387	16	95%	203–204, 285–286, 483, 770, 954, 958, 1292, 1308, 1316, 1417–1420, 1423
calibrator.py	293	9	96%	105, 189–190, 192, 220–221, 445–446, 449
compat
__init__.py	0	0	100%
instanovo.py	10	6	40%	12, 14–15, 17, 24–25
datasets
__init__.py	0	0	100%
calibration_dataset.py	109	17	84%	155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
data_loaders.py	276	13	95%	23, 205, 236–237, 434, 876, 880, 929, 940, 1054–1055, 1091–1092
feature_dataset.py	30	0	100%
interfaces.py	3	0	100%
psm_dataset.py	25	0	100%
fdr
__init__.py	0	0	100%
base.py	58	15	74%	81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
database_grounded.py	28	1	96%	52
nonparametric.py	25	4	84%	62, 68–69, 72
scripts
__init__.py	0	0	100%
main.py	285	285	0%	8, 10–14, 16–17, 20–23, 26–27, 29–31, 35, 42, 47, 50, 56, 58–59, 62, 71, 79, 82, 89, 91–93, 95, 97–102, 105, 107–108, 113, 128, 131, 138–144, 147–148, 151, 177–179, 181, 183, 188, 190–192, 194, 196–197, 199–200, 202–204, 209, 211, 214, 218–221, 223–224, 226, 228, 231, 241–242, 244–246, 248–253, 255, 258, 269–270, 272–274, 276–277, 279–281, 283–284, 287, 298–299, 301, 303–304, 306, 314–316, 318–321, 323, 327, 332, 335, 343–345, 347–350, 352–355, 358, 370–374, 378, 381, 390–391, 395–398, 402–405, 409–411, 414, 417, 426–427, 431, 433–434, 439–441, 443–444, 447, 456–460, 462–465, 471, 491–493, 495, 497, 502, 504–506, 508–509, 511–513, 515–517, 519–520, 529, 538–539, 541, 543, 545, 548–549, 551–553, 560, 563, 577–579, 582, 585, 590, 592–594, 596–598, 600–601, 604–605, 608, 610–611, 613, 615, 617–618, 620, 623–624, 630–632, 634–637, 640–641, 644–645, 648–649, 652–653, 661–663, 667, 670, 674, 677, 700, 713–714, 717, 739, 751–752, 755, 780, 793–794, 797, 812, 824–825, 828, 840, 852–853, 856, 871, 883–884
utils
__init__.py	4	0	100%
config_formatter.py	53	40	24%	29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
config_path.py	76	5	93%	24–26, 117–118
paths.py	20	0	100%
peptide.py	16	0	100%
TOTAL	1702	411	75%

Tests	Skipped	Failures	Errors	Time
335	0 💤	0 ❌	0 🔥	34.317s ⏱️

JemmaLDaniel added 25 commits April 9, 2026 16:36

feat: train per-experiment RT-iRT regressors

b605b23

test: update tests for per-experiment RT-iRT regressors

9004340

docs: update docs with new explanation of RetentionTimeFeature comput…

b839656

…ation and config arguments

perf: batch all per-experiment Koina calls into one

91f8112

Merge branch 'fix-per-experiment-irt-regression' into feat-pytorch-ca…

01e803e

…librator

feat: experiment-aware data loading for compute-features

3bd0e8e

feat: add option to compute feature matrix in compute-features

f92d76f

feat: add FeatureDataset, resolve_data_path and dependencies for torc…

26b3ebd

…h models

feat: rewrite ProbabilityCalibrator with PyTorch and safetensors, add…

9a8f364

…ing training history recording fix: convert ProbabilityCalibrator OmegaConf objects into plain Python types

feat: two-phase training pipeline with optional validation splitting

bd3d737

chore: update configs for PyTorch calibrator and two-phase training

79b2fc1

test: update and extend tests

ad071d3

chore: ignore saved safetensors

b0501c5

docs: update ProbabilityCalibrator and RetentionTimeFeature docs

551625a

chore: add docs and package build Make commands

b1083bd

feat: configurable predictions and log-probs column names for InstaNo…

29e156b

…vo backwards compatibility

test: add test for prediction and log-prob column remapping

c73cf70

Merge branch 'feat-prediction-column-remapping-for-instanovo-backward…

0576f6b

…s-compatibility' into feat-pytorch-calibrator

fix: make InstaNovoDatasetLoader residue_remapping optional

de6f79f

chore: use progress bar to log calibrator training progress

b704181

chore: replace mentions of 'Prosit' with 'Koina'

079f072

chore: standardise iRT metadata columns to lowercase snake case

967c825

feat: modify fit() to accept a CalibrationDataset directly and strip …

f3c4ca1

…formatting from TrainingHistory.plot()

docs: update calibrator training documentation

868342f

test: update calibrator training tests

bb6bf62

JemmaLDaniel self-assigned this Apr 14, 2026

JemmaLDaniel added the enhancement New feature or request label Apr 14, 2026

JemmaLDaniel mentioned this pull request Apr 14, 2026

feat: record training history #181

Closed

JemmaLDaniel changed the base branch from main to fix-per-experiment-irt-regression April 14, 2026 10:48

JemmaLDaniel changed the base branch from fix-per-experiment-irt-regression to main April 14, 2026 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat pytorch calibrator#190

Feat pytorch calibrator#190
JemmaLDaniel wants to merge 25 commits intomainfrom
feat-pytorch-calibrator

JemmaLDaniel commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JemmaLDaniel commented Apr 14, 2026

Summary

Key changes

PyTorch calibrator & training pipeline

Per-experiment iRT regression (merged from fix-per-experiment-irt-regression)

Prediction column remapping (merged from feat-prediction-column-remapping-for-instanovo-backwards-compatibility)

Data pipeline & build

Files changed (24 files, +2652/−761)

Test plan

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Per-experiment iRT regression (merged from `fix-per-experiment-irt-regression`)

Prediction column remapping (merged from `feat-prediction-column-remapping-for-instanovo-backwards-compatibility`)