issues decoding >5 modifications from MM/ML tags

Hi, first of all thanks for developing pysam, it's fantastic package! I'm using the latest version (`0.23.3`). 

Lately, I encountered several problems while working with modifications written by `remora infer`. 

Everything works fine if there is only 1 modification in the BAM, for example 
```
MM:Z:T+76?,0,0,0;       ML:B:C,4,5,4
```

But when I store several modifications (codes in range 76-94), pysam fails with `ValueError: chr() arg not in range(0x110000)`. 
Interestingly, both `samtools` and IGV are handling the modifications from the same BAM file correctly. 

I tried to narrow it down and noticed when storing the same modifications as single letter codes `m, h, c, f, g, e, b, a, o, n`, 
those are written in BAM as 
```
MM:Z:T+m?,0;T+h?,0;T+c?,0;T+f?,0;T+g?,0;T+e?,0;T+b?,0;T+a?,0;T+o?,0;T+n?,0;
```
while pysam decode them incorrectly as  
```
('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'c'), ('T', 0, 'f'), ('T', 0, 'g'), ('\x00', -1098023120, '\x01'), ('嶨', -1137636000, 483813872), ('碴', 0, 1120917344), ('\x00', -1154113870, 0), ('碴', 0, 1099221952)
```

It seems, the issue is after 5th modification. I prepared two BAM files containing 6 modifications
```
MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; ML:B:C,4,1,0,1,2,2
MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; ML:B:C,3,0,0,0,0,0
```
and 5 modifications (`g` was skipped)
```
MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0;        ML:B:C,4,1,0,1,2
MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0;        ML:B:C,3,0,0,0,0
```

and pysam failed with 6 modifications 
```python
# print(a.get_tag('MM'), a.get_tag('ML'), a.modified_bases.keys())
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; array('B', [4, 1, 0, 1, 2, 2]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'g'), ('\x00', -144053840, '\x1e')])
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; array('B', [3, 0, 0, 0, 0, 0]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'g'), ('瑄', -144053840, 187320624)])
```
but processed correctly 5 modifications
```python
# print(a.get_tag('MM'), a.get_tag('ML'), a.modified_bases.keys())
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0; array('B', [4, 1, 0, 1, 2]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'e')])
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0; array('B', [3, 0, 0, 0, 0]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'e')])
```
And because `g` was skipped in the 5 modification file, I know the issue is not `e` itself, just the number of mods per entry. 

Do you know why this happens?  
How could more than 5 modifications per base be processed with pysam? 

Of note, we work with RNA modifications, there are over 170 of biologically concurring RNA modifications, so this will be of interest to wider audience soon. 
Already at this time there are models to call [8 modifications in RNA](https://github.com/nanoporetech/dorado?tab=readme-ov-file#rna-models). 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues decoding >5 modifications from MM/ML tags #1348

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

issues decoding >5 modifications from MM/ML tags #1348

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions