-
Notifications
You must be signed in to change notification settings - Fork 296
Description
Hi, first of all thanks for developing pysam, it's fantastic package! I'm using the latest version (0.23.3).
Lately, I encountered several problems while working with modifications written by remora infer.
Everything works fine if there is only 1 modification in the BAM, for example
MM:Z:T+76?,0,0,0; ML:B:C,4,5,4
But when I store several modifications (codes in range 76-94), pysam fails with ValueError: chr() arg not in range(0x110000).
Interestingly, both samtools and IGV are handling the modifications from the same BAM file correctly.
I tried to narrow it down and noticed when storing the same modifications as single letter codes m, h, c, f, g, e, b, a, o, n,
those are written in BAM as
MM:Z:T+m?,0;T+h?,0;T+c?,0;T+f?,0;T+g?,0;T+e?,0;T+b?,0;T+a?,0;T+o?,0;T+n?,0;
while pysam decode them incorrectly as
('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'c'), ('T', 0, 'f'), ('T', 0, 'g'), ('\x00', -1098023120, '\x01'), ('嶨', -1137636000, 483813872), ('碴', 0, 1120917344), ('\x00', -1154113870, 0), ('碴', 0, 1099221952)
It seems, the issue is after 5th modification. I prepared two BAM files containing 6 modifications
MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; ML:B:C,4,1,0,1,2,2
MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; ML:B:C,3,0,0,0,0,0
and 5 modifications (g was skipped)
MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0; ML:B:C,4,1,0,1,2
MM:Z:T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0; ML:B:C,3,0,0,0,0
and pysam failed with 6 modifications
# print(a.get_tag('MM'), a.get_tag('ML'), a.modified_bases.keys())
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; array('B', [4, 1, 0, 1, 2, 2]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'g'), ('\x00', -144053840, '\x1e')])
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+g?,0;T+e?,0; array('B', [3, 0, 0, 0, 0, 0]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'g'), ('瑄', -144053840, 187320624)])but processed correctly 5 modifications
# print(a.get_tag('MM'), a.get_tag('ML'), a.modified_bases.keys())
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0; array('B', [4, 1, 0, 1, 2]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'e')])
T+m?,0;T+h?,0;T+f?,0;T+c?,0;T+e?,0; array('B', [3, 0, 0, 0, 0]) dict_keys([('T', 0, 'm'), ('T', 0, 'h'), ('T', 0, 'f'), ('T', 0, 'c'), ('T', 0, 'e')])And because g was skipped in the 5 modification file, I know the issue is not e itself, just the number of mods per entry.
Do you know why this happens?
How could more than 5 modifications per base be processed with pysam?
Of note, we work with RNA modifications, there are over 170 of biologically concurring RNA modifications, so this will be of interest to wider audience soon.
Already at this time there are models to call 8 modifications in RNA.