Fix per experiment irt regression#188
Open
JemmaLDaniel wants to merge 4 commits intomainfrom
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Per-experiment RT-to-iRT linear regression
Motivation
The existing
RetentionTimeFeatureuses a singleMLPRegressorto map observed retention times (RT) to indexed retention times (iRT) across the entire dataset. This is problematic for two reasons:The RT-to-iRT mapping is experiment-specific. Different LC-MS experiments have different chromatographic conditions (column, gradient, temperature, etc.), so a single global regressor conflates distinct linear relationships. When training or predicting on multi-experiment data, the MLP fits an average mapping that is suboptimal for every individual experiment.
The relationship is linear. RT and iRT are related by a simple affine transform within a single experiment. An MLP is unnecessarily complex for this; it introduces 7 hyperparameters (
hidden_dim,learning_rate_init,alpha,max_iter,early_stopping,validation_fraction,seed), risks overfitting on small experiments, and adds nondeterminism from stochastic optimisation. ALinearRegressionis the natural fit.A secondary issue is that the regressor was only fitted during training. While the trained MLP was pickled with the calibrator model and therefore available at inference time, it carried the training data's RT-to-iRT mapping. This will frequently be wrong for inference data from entirely different experiments with different chromatographic conditions. The regressor should always be re-fitted from the current data. The
compute-featuresentrypoint explicitly blocked unlabelled data withRetentionTimeFeaturebecause of thisif labelledgate.Changes
Per-experiment linear regressors (
calibration_features.py)MLPRegressorwith aDict[str, LinearRegression](irt_predictors), keyed by experiment name.prepare()groups spectra byexperiment_name, selects the toptrain_fractionby confidence per experiment, batches all Koina iRT calls into a single request, then fits oneLinearRegressionper experiment.experiment_nameis absent, falls back to a single__global__regressor with a warning.compute()applies the correct per-experiment regressor when predicting iRT from observed RT._select_training_data()helper with amin_train_pointsguard that raises early with a clear error message if an experiment has too few valid spectra.Always call
prepare()(calibrator.py)if labelledgate onfeature.prepare(). The RT regressor is self-supervised (uses high-confidence de novo predictions, not database labels), so it must run at both training and inference time.compute-featuresentrypoint restriction that blockedRetentionTimeFeatureon unlabelled data.Regressor checkpoint workflow (
main.py, configs)By default, the RT-to-iRT regressor is re-fitted from the current data at both training and inference time. This is the right behaviour for the general pretrained-model workflow where inference data comes from unseen experiments. However, there is an important within-experiment use case that requires carrying regressors forward: when a calibrator is trained on the subset of spectra that received database search labels and then applied to the remaining unlabelled de novo predictions from the same experiment(s). The unlabelled portion is typically contains a greater proportion of lower-quality spectra (the ones the database search couldn't confidently match), and its de novo prediction confidence distribution is likely to be skewed towards lower-confidence predictions. Fitting the RT-to-iRT regressor from the top predictions in this skewed distribution produces a noisier fit than one derived from the higher-confidence labelled portion. By saving the regressors fitted during training and loading them at inference time, the calibrator uses the cleaner mapping from the labelled data for experiments it has already seen, while still fitting fresh regressors for any new experiments encountered at inference.
save_regressors()/load_regressors()methods onRetentionTimeFeaturefor persisting per-experiment regressors to a pickle file, separate from the calibrator model.__getstate__/__setstate__exclude transient regressor state from the calibrator pickle — regressors are always re-fitted from data unless explicitly loaded.train.yaml: newirt_regressor_output_pathoption to save regressors after training.predict.yaml: newirt_regressor_pathoption to load pre-fitted regressors at inference time.main.py:train_entry_pointsaves regressors when configured;predict_entry_pointloads them when configured.experiment_namealways available (data_loaders.py)InstaNovoDatasetLoaderandMZTabDatasetLoadernow deriveexperiment_namefrom the file stem when the column is not already present, even whenadd_index_cols=False. This ensures the per-experiment grouping works without requiring explicit user configuration.Configuration simplification (
calibrator.yaml)hidden_dim,learning_rate_init,alpha,max_iter,early_stopping,validation_fractionand the MLP-specificseeddescription).min_train_points(default 10).train_fractionandseedretained with updated descriptions.Tests
RetentionTimeFeaturetests to use pre-fittedLinearRegressioninstead of mockingMLPRegressor.predict.test_pickle_excludes_regressor_stateandtest_save_and_load_regressors.test_data_loaders.pyto expectexperiment_namealways present.learn_from_missingfiltering tests for the new constructor signature.Documentation
RetentionTimeFeaturesection indocs/api/calibration.mdcovering per-experiment fitting, theexperiment_namecolumn, configuration parameters, and the regressor checkpoint workflow.docs/configuration.mdwith the simplified config block and remove thelabelledrestriction note.