Hi Dan, thanks again.
I’ve put a few tests together to try and make sure I get what’s happening here (apologies for the wall of text):
I have made a few changes after your explanation:
mt=mt.annotate_rows(reg=hl.parse_locus_interval("["+mt.locus.contig+":"+hl.str(mt.locus.position)+"-"+hl.str(mt.locus.position+hl.len(mt.alleles[0]))+")"))
So, I annotate a region (reg) as the length of the reference allele only now (I realized you don’t really need to consider anything else as that determines how much of the reference would overlap with the bed). I use ‘[’ to start so it is left inclusive and ‘)’ so it is right exclusive. Exclusive because position+length will be 1 greater so it should be the same as pos+len-1+’]’ right?
Anyway, the table is:
>>> mt.reg.show()
+---------------+----------------+-------------------------+
| locus | alleles | reg |
+---------------+----------------+-------------------------+
| locus<GRCh38> | array<str> | interval<locus<GRCh38>> |
+---------------+----------------+-------------------------+
| chr1:14397 | ["CTGT","C"] | [chr1:14397-chr1:14401) |
| chr1:14436 | ["G","A"] | [chr1:14436-chr1:14437) |
| chr1:14451 | ["C","T"] | [chr1:14451-chr1:14452) |
| chr1:14455 | ["T","TAAGAG"] | [chr1:14455-chr1:14456) |
| chr1:14464 | ["A","T"] | [chr1:14464-chr1:14465) |
| chr1:14469 | ["C","T"] | [chr1:14469-chr1:14470) |
| chr1:14471 | ["C","T"] | [chr1:14471-chr1:14472) |
| chr1:14487 | ["G","A"] | [chr1:14487-chr1:14488) |
| chr1:14498 | ["G","A"] | [chr1:14498-chr1:14499) |
+---------------+----------------+-------------------------+
So 1 base intervals e.g. chr1:14436 G/A are (properly?) represented as [chr1:14451-chr1:14452)
I’m a bit surprised by what reg.end.show() gives though:
>>> mt.reg.start.show()
+---------------+----------------+---------------+
| locus | alleles | <expr> |
+---------------+----------------+---------------+
| locus<GRCh38> | array<str> | locus<GRCh38> |
+---------------+----------------+---------------+
| chr1:14397 | ["CTGT","C"] | chr1:14397 |
| chr1:14436 | ["G","A"] | chr1:14436 |
| chr1:14451 | ["C","T"] | chr1:14451 |
| chr1:14455 | ["T","TAAGAG"] | chr1:14455 |
| chr1:14464 | ["A","T"] | chr1:14464 |
| chr1:14469 | ["C","T"] | chr1:14469 |
| chr1:14471 | ["C","T"] | chr1:14471 |
| chr1:14487 | ["G","A"] | chr1:14487 |
| chr1:14498 | ["G","A"] | chr1:14498 |
+---------------+----------------+---------------+
>>> mt.reg.end.show()
+---------------+----------------+---------------+
| locus | alleles | <expr> |
+---------------+----------------+---------------+
| locus<GRCh38> | array<str> | locus<GRCh38> |
+---------------+----------------+---------------+
| chr1:14397 | ["CTGT","C"] | chr1:14401 |
| chr1:14436 | ["G","A"] | chr1:14437 |
| chr1:14451 | ["C","T"] | chr1:14452 |
| chr1:14455 | ["T","TAAGAG"] | chr1:14456 |
| chr1:14464 | ["A","T"] | chr1:14465 |
| chr1:14469 | ["C","T"] | chr1:14470 |
| chr1:14471 | ["C","T"] | chr1:14472 |
| chr1:14487 | ["G","A"] | chr1:14488 |
| chr1:14498 | ["G","A"] | chr1:14499 |
+---------------+----------------+---------------+
If the region is exclusive of the final position I would expect that it isn’t included in the .end()? But it appears to be. If it still appears as a locus expression then will it be counted in the index of the bed table? I tested to see if it would but something more unexpected happenned:
Using an imported test bed file:
>>> test.show()
+-------------------------+--------+
| interval | target |
+-------------------------+--------+
| interval<locus<GRCh38>> | str |
+-------------------------+--------+
| [chr1:14397-chr1:14398) | "trf" |
| [chr1:14437-chr1:14438) | "trf" |
| [chr1:14464-chr1:14465) | "trf" |
| [chr1:14496-chr1:14500) | "trf" |
+-------------------------+--------+
adapting your code:
mt = mt.annotate_rows(cases_start = test.index(mt.reg.start, all_matches=True))
mt = mt.annotate_rows(cases_end = test.index(mt.reg.end, all_matches=True))
+---------------+----------------+----------------------------+
| locus | alleles | cases_start |
+---------------+----------------+----------------------------+
| locus<GRCh38> | array<str> | array<struct{target: str}> |
+---------------+----------------+----------------------------+
| chr1:14397 | ["CTGT","C"] | [("trf")] |
| chr1:14436 | ["G","A"] | [] |
| chr1:14451 | ["C","T"] | [] |
| chr1:14455 | ["T","TAAGAG"] | [] |
| chr1:14464 | ["A","T"] | [("trf")] |
| chr1:14469 | ["C","T"] | [] |
| chr1:14471 | ["C","T"] | [] |
| chr1:14487 | ["G","A"] | [] |
| chr1:14498 | ["G","A"] | [("trf")] |
+---------------+----------------+----------------------------+
>>>
+---------------+----------------+----------------------------+
| locus | alleles | cases_end |
+---------------+----------------+----------------------------+
| locus<GRCh38> | array<str> | array<struct{target: str}> |
+---------------+----------------+----------------------------+
| chr1:14397 | ["CTGT","C"] | [("trf")] |
| chr1:14436 | ["G","A"] | [] |
| chr1:14451 | ["C","T"] | [] |
| chr1:14455 | ["T","TAAGAG"] | [] |
| chr1:14464 | ["A","T"] | [("trf")] |
| chr1:14469 | ["C","T"] | [] |
| chr1:14471 | ["C","T"] | [] |
| chr1:14487 | ["G","A"] | [] |
| chr1:14498 | ["G","A"] | [("trf")] |
+---------------+----------------+----------------------------+
Both commands give the same annotations. For the first variant:
| chr1:14397 | ["CTGT","C"] | [chr1:14397-chr1:14401) |
I would expect the index command to annotate the region using the start (chr1:14397) but not the end (chr1:14401 ) the region being: [chr1:14397-chr1:14398). Any idea what I’m missing here?
Thanks again,
Dan