Skip to content

Update Prosit cysteine handling and standardise residue terminology#158

Merged
JemmaLDaniel merged 13 commits into
mainfrom
157-support-prosit-ptm-models-for-prosit-features
Feb 16, 2026
Merged

Update Prosit cysteine handling and standardise residue terminology#158
JemmaLDaniel merged 13 commits into
mainfrom
157-support-prosit-ptm-models-for-prosit-features

Conversation

@JemmaLDaniel
Copy link
Copy Markdown
Collaborator

Description

Prosit model providers have updated their handling of cysteine carbamidomethylation. Previously, Prosit models treated plain C as carbamidomethylated. The updated models now require explicit carbamidomethylation annotation (C[UNIMOD:4]) to be passed to the model.

Additionally, this PR standardises terminology across the codebase by renaming invalid_prosit_tokens to invalid_prosit_residues to align with the existing residue_masses configuration.

Changes

1. Prosit Cysteine Handling

  • Removed the map_modification() function that converted C[UNIMOD:4]C
  • Prosit features now pass carbamidomethylation explicitly to models
  • Peptides with unmodified cysteine (C) continue to be filtered out as invalid
  • Updated all Prosit-dependent features: PrositFeatures, ChimericFeatures, RetentionTimeFeature

2. Residue-Based Filtering

  • Changed filtering from untokenized strings (prediction_untokenised) to tokenized residue lists (prediction)
  • Invalid residues now specified in tokenised form matching residue_masses format:
    • Residue modifications: N[UNIMOD:7], Q[UNIMOD:7], S[UNIMOD:21], etc.
    • N-terminal modifications: [UNIMOD:1], [UNIMOD:5], [UNIMOD:385]
  • More precise matching: distinguishes between different amino acids with the same modification, or filter on exotic amino acid symbols like U.

3. Terminology Standardization

  • Renamed: invalid_prosit_tokensinvalid_prosit_residues
  • Updated across:
    • Code: PrositFeatures, ChimericFeatures, RetentionTimeFeature class parameters
    • Configuration: winnow/configs/residues.yaml
    • Documentation: API docs and configuration guide
    • Tests: All test fixtures and assertions

4. New Test Coverage

  • Added comprehensive data loader tests (tests/datasets/test_data_loaders.py):
    • Unit tests for MZTab and InstaNovo token remapping
    • Integration tests verifying Casanovo/InstaNovo tokens map to UNIMOD equivalents
    • Tests confirming only UNIMOD forms needed in invalid_prosit_residues

5. Documentation Updates

  • Updated configuration guide with cysteine handling notes
  • Added clarification that invalid_prosit_residues uses UNIMOD format
  • Updated API documentation with examples using new terminology
  • Expanded invalid residues list to include all supported-but-invalid modifications

Breaking Changes

The configuration parameter invalid_prosit_tokens has been renamed to invalid_prosit_residues. Please update customised residues.yaml to the new parameter name, using only UNIMOD tokenised format.

@JemmaLDaniel JemmaLDaniel self-assigned this Feb 4, 2026
@JemmaLDaniel JemmaLDaniel added the bug Something isn't working label Feb 4, 2026
@JemmaLDaniel JemmaLDaniel linked an issue Feb 4, 2026 that may be closed by this pull request
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 4, 2026

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
data_types.py40100% 
calibration
   __init__.py00100% 
   calibration_features.py2641195%148–149, 312–314, 549–551, 1024–1026
   calibrator.py911583%69–70, 72, 106–109, 134–135, 137, 162–163, 167, 194–195
compat
   __init__.py00100% 
   instanovo.py10640%12, 14–15, 17, 24–25
datasets
   __init__.py00100% 
   calibration_dataset.py871286%138, 191, 193–194, 200–203, 205–208
   data_loaders.py20315324%62–64, 67, 69, 74, 79, 91, 93–96, 98, 102–103, 105, 107, 122, 128, 137–138, 142, 160–161, 163–165, 167–169, 171–172, 174, 188, 191–192, 194–195, 200, 206–207, 216, 219, 229, 247–249, 258, 272, 277–279, 281, 285–286, 288–289, 294, 297, 304, 310, 324–325, 328, 331–332, 341, 350, 407–408, 410–413, 415, 419–420, 422, 434–436, 439–440, 458–459, 461–462, 465–467, 472, 475–476, 479–480, 486–487, 489, 501, 504, 516–517, 524, 530, 534, 536, 539, 543, 565, 590–591, 593–594, 597–599, 609–610, 640, 653–654, 658, 662, 676, 696, 706–707, 718, 795, 815–816, 818–820, 826–830, 835–836, 839, 842, 847, 853–859, 865–866
   interfaces.py30100% 
   psm_dataset.py250100% 
fdr
   __init__.py00100% 
   base.py581574%81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
   database_grounded.py250100% 
   nonparametric.py25484%62, 68–69, 72
scripts
   __init__.py00100% 
   main.py1361360%8, 10–14, 17–21, 24–25, 27–29, 33, 40, 45, 48, 54, 56–57, 60, 69, 77, 80, 87, 89–91, 93, 95–100, 103, 105–106, 111, 126, 129, 135–136, 138–140, 143–144, 147, 160–162, 165, 168, 173, 175–177, 179, 181–182, 185–186, 189, 191–192, 194, 196–197, 200–201, 204–205, 208–209, 212–213, 215, 218, 232–234, 237, 240, 245, 247–249, 251–252, 254–255, 258–259, 262, 264–265, 267, 269–270, 273–274, 280–281, 284–285, 288–289, 292–293, 301–302, 305–308, 312, 315, 338, 351–352, 355, 380, 393–394, 397, 412, 424–425, 428, 443, 455–456
utils
   __init__.py40100% 
   config_formatter.py534024%29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
   config_path.py76593%24–26, 117–118
   peptide.py160100% 
TOTAL108039763% 

Tests Skipped Failures Errors Time
174 0 💤 0 ❌ 0 🔥 12.358s ⏱️

@JemmaLDaniel JemmaLDaniel requested a review from BioGeek February 4, 2026 09:35
Comment thread tests/datasets/test_data_loaders.py Outdated
)
assert (
mztab_loader._map_modifications("[Ammonia-loss]-PEPTIDE")
== "[UNIMOD:385]PEPTIDE"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[UNIMOD:385]PEPTIDE is not ProForma compliant. The ProForma compliant version is [UNIMOD:385]-PEPTIDE.
I would prefer to only support ProForma complaint peptide strings. (Might need updates on the InstaNovo/Casanovo/... side though?)

>>> from pyteomics import proforma
>>> proforma.parse("[UNIMOD:385]PEPTIDE")
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    proforma.parse("[UNIMOD:385]PEPTIDE")
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/j-vangoey/code/winnow/.venv/lib/python3.13/site-packages/pyteomics/proforma.py", line 2024, in parse
    raise ProFormaError(
        "Error In State {state}, unexpected {c} found at index {i}".format(**locals()), i, state)
pyteomics.proforma.ProFormaError: Pyteomics error, message: 'Error In State ParserStateEnum.post_tag_before, unexpected P found at index 13' ('Error In State ParserStateEnum.post_tag_before, unexpected P found at index 13', 13, <ParserStateEnum.post_tag_before: 10>)
>>> proforma.parse("[UNIMOD:385]-PEPTIDE")
([('P', None), ('E', None), ('P', None), ('T', None), ('I', None), ('D', None), ('E', None)], {'n_term': [UnimodModification('385', None, None)], 'c_term': None, 'unlocalized_modifications': [], 'labile_modifications': [], 'fixed_modifications': [], 'intervals': [], 'isotopes': [], 'group_ids': [], 'charge_state': None})

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my mistake, will fix

Copy link
Copy Markdown
Contributor

@BioGeek BioGeek Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a large amount of work, then a fix can be in an additional PR and we can get this merged already.

Comment thread uv.lock
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as here.

If you update uv.lock, you should also regenerate requirments.txt (uv pip compile pyproject.toml -o requirements.txt).

To not have to do this manually each time, you can add a pre-commit hook to do this automatically:

repos:
  - repo: https://github.com/astral-sh/uv-pre-commit
    # uv version.
    rev: 0.9.25
    hooks:
      - id: uv-export

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's nice, thanks, will do

@JemmaLDaniel JemmaLDaniel requested a review from BioGeek February 11, 2026 15:34
Copy link
Copy Markdown
Contributor

@BioGeek BioGeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@JemmaLDaniel JemmaLDaniel merged commit 614739a into main Feb 16, 2026
4 checks passed
@JemmaLDaniel JemmaLDaniel deleted the 157-support-prosit-ptm-models-for-prosit-features branch February 16, 2026 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Prosit PTM models for Prosit features

2 participants