Skip to content

Fix per experiment irt regression#188

Open
JemmaLDaniel wants to merge 4 commits intomainfrom
fix-per-experiment-irt-regression
Open

Fix per experiment irt regression#188
JemmaLDaniel wants to merge 4 commits intomainfrom
fix-per-experiment-irt-regression

Conversation

@JemmaLDaniel
Copy link
Copy Markdown
Collaborator

Per-experiment RT-to-iRT linear regression

Motivation

The existing RetentionTimeFeature uses a single MLPRegressor to map observed retention times (RT) to indexed retention times (iRT) across the entire dataset. This is problematic for two reasons:

  1. The RT-to-iRT mapping is experiment-specific. Different LC-MS experiments have different chromatographic conditions (column, gradient, temperature, etc.), so a single global regressor conflates distinct linear relationships. When training or predicting on multi-experiment data, the MLP fits an average mapping that is suboptimal for every individual experiment.

  2. The relationship is linear. RT and iRT are related by a simple affine transform within a single experiment. An MLP is unnecessarily complex for this; it introduces 7 hyperparameters (hidden_dim, learning_rate_init, alpha, max_iter, early_stopping, validation_fraction, seed), risks overfitting on small experiments, and adds nondeterminism from stochastic optimisation. A LinearRegression is the natural fit.

A secondary issue is that the regressor was only fitted during training. While the trained MLP was pickled with the calibrator model and therefore available at inference time, it carried the training data's RT-to-iRT mapping. This will frequently be wrong for inference data from entirely different experiments with different chromatographic conditions. The regressor should always be re-fitted from the current data. The compute-features entrypoint explicitly blocked unlabelled data with RetentionTimeFeature because of this if labelled gate.

Changes

Per-experiment linear regressors (calibration_features.py)

  • Replace the single MLPRegressor with a Dict[str, LinearRegression] (irt_predictors), keyed by experiment name.
  • prepare() groups spectra by experiment_name, selects the top train_fraction by confidence per experiment, batches all Koina iRT calls into a single request, then fits one LinearRegression per experiment.
  • If experiment_name is absent, falls back to a single __global__ regressor with a warning.
  • Experiments that already have a fitted regressor (e.g. loaded from a checkpoint) are skipped.
  • compute() applies the correct per-experiment regressor when predicting iRT from observed RT.
  • New _select_training_data() helper with a min_train_points guard that raises early with a clear error message if an experiment has too few valid spectra.

Always call prepare() (calibrator.py)

  • Remove the if labelled gate on feature.prepare(). The RT regressor is self-supervised (uses high-confidence de novo predictions, not database labels), so it must run at both training and inference time.
  • This also removes the compute-features entrypoint restriction that blocked RetentionTimeFeature on unlabelled data.

Regressor checkpoint workflow (main.py, configs)

By default, the RT-to-iRT regressor is re-fitted from the current data at both training and inference time. This is the right behaviour for the general pretrained-model workflow where inference data comes from unseen experiments. However, there is an important within-experiment use case that requires carrying regressors forward: when a calibrator is trained on the subset of spectra that received database search labels and then applied to the remaining unlabelled de novo predictions from the same experiment(s). The unlabelled portion is typically contains a greater proportion of lower-quality spectra (the ones the database search couldn't confidently match), and its de novo prediction confidence distribution is likely to be skewed towards lower-confidence predictions. Fitting the RT-to-iRT regressor from the top predictions in this skewed distribution produces a noisier fit than one derived from the higher-confidence labelled portion. By saving the regressors fitted during training and loading them at inference time, the calibrator uses the cleaner mapping from the labelled data for experiments it has already seen, while still fitting fresh regressors for any new experiments encountered at inference.

  • save_regressors() / load_regressors() methods on RetentionTimeFeature for persisting per-experiment regressors to a pickle file, separate from the calibrator model.
  • __getstate__ / __setstate__ exclude transient regressor state from the calibrator pickle — regressors are always re-fitted from data unless explicitly loaded.
  • train.yaml: new irt_regressor_output_path option to save regressors after training.
  • predict.yaml: new irt_regressor_path option to load pre-fitted regressors at inference time.
  • main.py: train_entry_point saves regressors when configured; predict_entry_point loads them when configured.

experiment_name always available (data_loaders.py)

  • InstaNovoDatasetLoader and MZTabDatasetLoader now derive experiment_name from the file stem when the column is not already present, even when add_index_cols=False. This ensures the per-experiment grouping works without requiring explicit user configuration.

Configuration simplification (calibrator.yaml)

  • Remove 7 MLP hyperparameters (hidden_dim, learning_rate_init, alpha, max_iter, early_stopping, validation_fraction and the MLP-specific seed description).
  • Add min_train_points (default 10).
  • train_fraction and seed retained with updated descriptions.

Tests

  • Rewrite RetentionTimeFeature tests to use pre-fitted LinearRegression instead of mocking MLPRegressor.predict.
  • Add test_pickle_excludes_regressor_state and test_save_and_load_regressors.
  • Update test_data_loaders.py to expect experiment_name always present.
  • Update learn_from_missing filtering tests for the new constructor signature.

Documentation

  • Rewrite RetentionTimeFeature section in docs/api/calibration.md covering per-experiment fitting, the experiment_name column, configuration parameters, and the regressor checkpoint workflow.
  • Update docs/configuration.md with the simplified config block and remove the labelled restriction note.

@JemmaLDaniel JemmaLDaniel self-assigned this Apr 10, 2026
@JemmaLDaniel JemmaLDaniel added the bug Something isn't working label Apr 10, 2026
@github-actions
Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
data_types.py40100% 
calibration
   __init__.py00100% 
   calibration_features.py3651496%247–248, 443, 728, 916, 920, 1232, 1248, 1256, 1357–1360, 1363
   calibrator.py901583%69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
compat
   __init__.py00100% 
   instanovo.py10640%12, 14–15, 17, 24–25
datasets
   __init__.py00100% 
   calibration_dataset.py1091784%155, 169, 171, 173, 183, 196, 249, 251–252, 258–261, 263–266
   data_loaders.py2741494%23, 189, 220–221, 418, 459, 856, 860, 909, 920, 1032–1033, 1061–1062
   interfaces.py30100% 
   psm_dataset.py250100% 
fdr
   __init__.py00100% 
   base.py581574%81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
   database_grounded.py28196%52
   nonparametric.py25484%62, 68–69, 72
scripts
   __init__.py00100% 
   main.py1971970%8, 10–13, 16–20, 23–24, 26–28, 32, 39, 44, 47, 53, 55–56, 59, 68, 76, 79, 86, 88–90, 92, 94–99, 102, 104–105, 110, 125, 128, 135–141, 144–145, 148, 161–163, 166, 169, 174, 176–178, 180, 182–183, 186–187, 190, 192–193, 195, 197, 199–200, 202, 205–206, 209–210, 213–214, 217–219, 221–224, 227–229, 231, 234, 248–250, 252, 254, 259, 261–263, 265–266, 268, 270–271, 273–275, 277, 279, 281–282, 286–289, 291–292, 294–295, 297–298, 300, 303, 317–319, 322, 325, 330, 332–334, 336–338, 340–341, 344–345, 348, 350–351, 353, 355, 357–358, 360, 363–364, 370–372, 374–377, 380–381, 384–385, 388–389, 392–393, 401–403, 407, 410, 414, 417, 440, 453–454, 457, 479, 491–492, 495, 520, 533–534, 537, 552, 564–565, 568, 580, 592–593, 596, 611, 623–624
utils
   __init__.py40100% 
   config_formatter.py534024%29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
   config_path.py76593%24–26, 117–118
   peptide.py160100% 
TOTAL133732875% 

Tests Skipped Failures Errors Time
298 0 💤 0 ❌ 0 🔥 35.525s ⏱️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant