Skip to content

issues decoding >5 modifications from MM/ML tags #1348

@lpryszcz

Description

@lpryszcz

Hi, first of all thanks for developing pysam, it's fantastic package! I'm using the latest version (0.23.3).

Lately, I encountered several problems while working with modifications written by remora infer.

Everything works fine if there is only 1 modification in the BAM, for example

MM:Z:T+76?,0,0,0;       ML:B:C,4,5,4

But when I store several modifications (codes in range 76-94), pysam fails with ValueError: chr() arg not in range(0x110000).
Interestingly, both samtools and IGV are handling the modifications from the same BAM file correctly.

I tried to narrow it down and noticed when storing the same modifications as single letter codes m, h, c, f, g, e, b, a, o, n,
those are written in BAM as

MM:Z:T+m?,0;T+h?,0;T+c?,0;T+f?,0;T+g?,0;T+e?,0;T+b?,0;T+a?,0;T+o?,0;T+n?,0;

while pysam decode them incorrectly as

('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'c'), ('T', 0, 'f'), ('T', 0, 'g'), ('\x00', -1098023120, '\x01'), ('嶨', -1137636000, 483813872), ('碴', 0, 1120917344), ('\x00', -1154113870, 0), ('碴', 0, 1099221952)

It seems, the issue is after 5th modification. I prepared two BAM files containing 6 modifications

MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; ML:B:C,4,1,0,1,2,2
MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; ML:B:C,3,0,0,0,0,0

and 5 modifications (g was skipped)

MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0;        ML:B:C,4,1,0,1,2
MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0;        ML:B:C,3,0,0,0,0

and pysam failed with 6 modifications

# print(a.get_tag('MM'), a.get_tag('ML'), a.modified_bases.keys())
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; array('B', [4, 1, 0, 1, 2, 2]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'g'), ('\x00', -144053840, '\x1e')])
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; array('B', [3, 0, 0, 0, 0, 0]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'g'), ('瑄', -144053840, 187320624)])

but processed correctly 5 modifications

# print(a.get_tag('MM'), a.get_tag('ML'), a.modified_bases.keys())
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0; array('B', [4, 1, 0, 1, 2]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'e')])
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0; array('B', [3, 0, 0, 0, 0]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'e')])

And because g was skipped in the 5 modification file, I know the issue is not e itself, just the number of mods per entry.

Do you know why this happens?
How could more than 5 modifications per base be processed with pysam?

Of note, we work with RNA modifications, there are over 170 of biologically concurring RNA modifications, so this will be of interest to wider audience soon.
Already at this time there are models to call 8 modifications in RNA.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions