Fix per experiment irt regression by JemmaLDaniel · Pull Request #188 · instadeepai/winnow

JemmaLDaniel · 2026-04-10T17:07:28Z

Per-experiment RT-to-iRT linear regression

Motivation

The existing RetentionTimeFeature uses a single MLPRegressor to map observed retention times (RT) to indexed retention times (iRT) across the entire dataset. This is problematic for two reasons:

The RT-to-iRT mapping is experiment-specific. Different LC-MS experiments have different chromatographic conditions (column, gradient, temperature, etc.), so a single global regressor conflates distinct linear relationships. When training or predicting on multi-experiment data, the MLP fits an average mapping that is suboptimal for every individual experiment.
The relationship is linear. RT and iRT are related by a simple affine transform within a single experiment. An MLP is unnecessarily complex for this; it introduces 7 hyperparameters (hidden_dim, learning_rate_init, alpha, max_iter, early_stopping, validation_fraction, seed), risks overfitting on small experiments, and adds nondeterminism from stochastic optimisation. A LinearRegression is the natural fit.

A secondary issue is that the regressor was only fitted during training. While the trained MLP was pickled with the calibrator model and therefore available at inference time, it carried the training data's RT-to-iRT mapping. This will frequently be wrong for inference data from entirely different experiments with different chromatographic conditions. The regressor should always be re-fitted from the current data. The compute-features entrypoint explicitly blocked unlabelled data with RetentionTimeFeature because of this if labelled gate.

Changes

Per-experiment linear regressors (`calibration_features.py`)

Replace the single MLPRegressor with a Dict[str, LinearRegression] (irt_predictors), keyed by experiment name.
prepare() groups spectra by experiment_name, selects the top train_fraction by confidence per experiment, batches all Koina iRT calls into a single request, then fits one LinearRegression per experiment.
If experiment_name is absent, falls back to a single __global__ regressor with a warning.
Experiments that already have a fitted regressor (e.g. loaded from a checkpoint) are skipped.
compute() applies the correct per-experiment regressor when predicting iRT from observed RT.
New _select_training_data() helper with a min_train_points guard that raises early with a clear error message if an experiment has too few valid spectra.

Always call `prepare()` (`calibrator.py`)

Remove the if labelled gate on feature.prepare(). The RT regressor is self-supervised (uses high-confidence de novo predictions, not database labels), so it must run at both training and inference time.
This also removes the compute-features entrypoint restriction that blocked RetentionTimeFeature on unlabelled data.

Regressor checkpoint workflow (`main.py`, configs)

By default, the RT-to-iRT regressor is re-fitted from the current data at both training and inference time. This is the right behaviour for the general pretrained-model workflow where inference data comes from unseen experiments. However, there is an important within-experiment use case that requires carrying regressors forward: when a calibrator is trained on the subset of spectra that received database search labels and then applied to the remaining unlabelled de novo predictions from the same experiment(s). The unlabelled portion is typically contains a greater proportion of lower-quality spectra (the ones the database search couldn't confidently match), and its de novo prediction confidence distribution is likely to be skewed towards lower-confidence predictions. Fitting the RT-to-iRT regressor from the top predictions in this skewed distribution produces a noisier fit than one derived from the higher-confidence labelled portion. By saving the regressors fitted during training and loading them at inference time, the calibrator uses the cleaner mapping from the labelled data for experiments it has already seen, while still fitting fresh regressors for any new experiments encountered at inference.

save_regressors() / load_regressors() methods on RetentionTimeFeature for persisting per-experiment regressors to a pickle file, separate from the calibrator model.
__getstate__ / __setstate__ exclude transient regressor state from the calibrator pickle — regressors are always re-fitted from data unless explicitly loaded.
train.yaml: new irt_regressor_output_path option to save regressors after training.
predict.yaml: new irt_regressor_path option to load pre-fitted regressors at inference time.
main.py: train_entry_point saves regressors when configured; predict_entry_point loads them when configured.

`experiment_name` always available (`data_loaders.py`)

InstaNovoDatasetLoader and MZTabDatasetLoader now derive experiment_name from the file stem when the column is not already present, even when add_index_cols=False. This ensures the per-experiment grouping works without requiring explicit user configuration.

Configuration simplification (`calibrator.yaml`)

Remove 7 MLP hyperparameters (hidden_dim, learning_rate_init, alpha, max_iter, early_stopping, validation_fraction and the MLP-specific seed description).
Add min_train_points (default 10).
train_fraction and seed retained with updated descriptions.

Tests

Rewrite RetentionTimeFeature tests to use pre-fitted LinearRegression instead of mocking MLPRegressor.predict.
Add test_pickle_excludes_regressor_state and test_save_and_load_regressors.
Update test_data_loaders.py to expect experiment_name always present.
Update learn_from_missing filtering tests for the new constructor signature.

Documentation

Rewrite RetentionTimeFeature section in docs/api/calibration.md covering per-experiment fitting, the experiment_name column, configuration parameters, and the regressor checkpoint workflow.
Update docs/configuration.md with the simplified config block and remove the labelled restriction note.

…ation and config arguments

github-actions · 2026-04-10T17:12:29Z

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	0	0	100%
data_types.py	4	0	100%
calibration
__init__.py	0	0	100%
calibration_features.py	365	14	96%	247–248, 443, 728, 916, 920, 1232, 1248, 1256, 1357–1360, 1363
calibrator.py	90	15	83%	69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
compat
__init__.py	0	0	100%
instanovo.py	10	6	40%	12, 14–15, 17, 24–25
datasets
__init__.py	0	0	100%
calibration_dataset.py	109	17	84%	155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
data_loaders.py	274	14	94%	23, 189, 220–221, 418, 459, 856, 860, 909, 920, 1032–1033, 1061–1062
interfaces.py	3	0	100%
psm_dataset.py	25	0	100%
fdr
__init__.py	0	0	100%
base.py	58	15	74%	81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
database_grounded.py	28	1	96%	52
nonparametric.py	25	4	84%	62, 68–69, 72
scripts
__init__.py	0	0	100%
main.py	197	197	0%	8, 10–13, 16–20, 23–24, 26–28, 32, 39, 44, 47, 53, 55–56, 59, 68, 76, 79, 86, 88–90, 92, 94–99, 102, 104–105, 110, 125, 128, 135–141, 144–145, 148, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221–224, 227–229, 231, 234, 248–250, 252, 254, 259, 261–263, 265–266, 268, 270–271, 273–275, 277, 279, 281–282, 286–289, 291–292, 294–295, 297–298, 300, 303, 317–319, 322, 325, 330, 332–334, 336–338, 340–341, 344–345, 348, 350–351, 353, 355, 357–358, 360, 363–364, 370–372, 374–377, 380–381, 384–385, 388–389, 392–393, 401–403, 407, 410, 414, 417, 440, 453–454, 457, 479, 491–492, 495, 520, 533–534, 537, 552, 564–565, 568, 580, 592–593, 596, 611, 623–624
utils
__init__.py	4	0	100%
config_formatter.py	53	40	24%	29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
config_path.py	76	5	93%	24–26, 117–118
peptide.py	16	0	100%
TOTAL	1337	328	75%

Tests	Skipped	Failures	Errors	Time
298	0 💤	0 ❌	0 🔥	35.525s ⏱️

JemmaLDaniel added 4 commits April 9, 2026 16:36

feat: train per-experiment RT-iRT regressors

b605b23

test: update tests for per-experiment RT-iRT regressors

9004340

docs: update docs with new explanation of RetentionTimeFeature comput…

b839656

…ation and config arguments

perf: batch all per-experiment Koina calls into one

91f8112

JemmaLDaniel self-assigned this Apr 10, 2026

JemmaLDaniel added the bug Something isn't working label Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix per experiment irt regression#188

Fix per experiment irt regression#188
JemmaLDaniel wants to merge 4 commits intomainfrom
fix-per-experiment-irt-regression

JemmaLDaniel commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JemmaLDaniel commented Apr 10, 2026

Per-experiment RT-to-iRT linear regression

Motivation

Changes

Per-experiment linear regressors (calibration_features.py)

Always call prepare() (calibrator.py)

Regressor checkpoint workflow (main.py, configs)

experiment_name always available (data_loaders.py)

Configuration simplification (calibrator.yaml)

Tests

Documentation

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Per-experiment linear regressors (`calibration_features.py`)

Always call `prepare()` (`calibrator.py`)

Regressor checkpoint workflow (`main.py`, configs)

`experiment_name` always available (`data_loaders.py`)

Configuration simplification (`calibrator.yaml`)