Update Prosit cysteine handling and standardise residue terminology#158
Conversation
…ts - explicit carboxyamidomethylation notation docs: remove unnecessary capitalisation in docs
…id prosit residues
| ) | ||
| assert ( | ||
| mztab_loader._map_modifications("[Ammonia-loss]-PEPTIDE") | ||
| == "[UNIMOD:385]PEPTIDE" |
There was a problem hiding this comment.
[UNIMOD:385]PEPTIDE is not ProForma compliant. The ProForma compliant version is [UNIMOD:385]-PEPTIDE.
I would prefer to only support ProForma complaint peptide strings. (Might need updates on the InstaNovo/Casanovo/... side though?)
>>> from pyteomics import proforma
>>> proforma.parse("[UNIMOD:385]PEPTIDE")
Traceback (most recent call last):
File "<python-input-1>", line 1, in <module>
proforma.parse("[UNIMOD:385]PEPTIDE")
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
File "/home/j-vangoey/code/winnow/.venv/lib/python3.13/site-packages/pyteomics/proforma.py", line 2024, in parse
raise ProFormaError(
"Error In State {state}, unexpected {c} found at index {i}".format(**locals()), i, state)
pyteomics.proforma.ProFormaError: Pyteomics error, message: 'Error In State ParserStateEnum.post_tag_before, unexpected P found at index 13' ('Error In State ParserStateEnum.post_tag_before, unexpected P found at index 13', 13, <ParserStateEnum.post_tag_before: 10>)
>>> proforma.parse("[UNIMOD:385]-PEPTIDE")
([('P', None), ('E', None), ('P', None), ('T', None), ('I', None), ('D', None), ('E', None)], {'n_term': [UnimodModification('385', None, None)], 'c_term': None, 'unlocalized_modifications': [], 'labile_modifications': [], 'fixed_modifications': [], 'intervals': [], 'isotopes': [], 'group_ids': [], 'charge_state': None})There was a problem hiding this comment.
Oh my mistake, will fix
There was a problem hiding this comment.
If it is a large amount of work, then a fix can be in an additional PR and we can get this merged already.
There was a problem hiding this comment.
Same comment as here.
If you update uv.lock, you should also regenerate requirments.txt (uv pip compile pyproject.toml -o requirements.txt).
To not have to do this manually each time, you can add a pre-commit hook to do this automatically:
repos:
- repo: https://github.com/astral-sh/uv-pre-commit
# uv version.
rev: 0.9.25
hooks:
- id: uv-export
There was a problem hiding this comment.
That's nice, thanks, will do
…es are always synced to pyproject.toml
Description
Prosit model providers have updated their handling of cysteine carbamidomethylation. Previously, Prosit models treated plain
Cas carbamidomethylated. The updated models now require explicit carbamidomethylation annotation (C[UNIMOD:4]) to be passed to the model.Additionally, this PR standardises terminology across the codebase by renaming
invalid_prosit_tokenstoinvalid_prosit_residuesto align with the existingresidue_massesconfiguration.Changes
1. Prosit Cysteine Handling
map_modification()function that convertedC[UNIMOD:4]→C2. Residue-Based Filtering
N[UNIMOD:7],Q[UNIMOD:7],S[UNIMOD:21], etc.[UNIMOD:1],[UNIMOD:5],[UNIMOD:385]U.3. Terminology Standardization
invalid_prosit_tokens→invalid_prosit_residuesPrositFeatures,ChimericFeatures,RetentionTimeFeatureclass parameterswinnow/configs/residues.yaml4. New Test Coverage
tests/datasets/test_data_loaders.py):invalid_prosit_residues5. Documentation Updates
invalid_prosit_residuesuses UNIMOD formatBreaking Changes
The configuration parameter
invalid_prosit_tokenshas been renamed toinvalid_prosit_residues. Please update customisedresidues.yamlto the new parameter name, using only UNIMOD tokenised format.