Annotating deletion overlapping with bed

Hi,

I have a bed file I am attempting to annotate my mt with.

mt=mt.annotate_rows(trf=trf[mt.locus].target)

Which works as expected, however I have deletions that start outside the regions but overlap it that I would like to annotate as well as technically they are within those regions.

Can anyone think how the best way to proceed with this in hail?

So far I have tried:

mt=mt.annotate_rows(region=hl.parse_locus_interval(mt.locus.contig+":"+hl.str(mt.locus.position)+"-"+hl.str(mt.locus.position+hl.max(0,hl.len(mt.alleles[0])-hl.len(mt.alleles[1])))))

which will annotate the mt with a region that is :
position - position + max(0,len(allele1)-len(allele2))

i.e. for in del allele1 > allele2 so will be size of indel
for ins allele2> allele1 but max(0,…) will make it 0
for SNV allele1==allele2 so len 0

I just realized I need something for mnps… I’ll have to think that through.

The annotation makes:

| locus         | alleles        | region                        |
+---------------+----------------+-------------------------------+
| locus<GRCh38> | array<str>     | interval<locus<GRCh38>>       |
+---------------+----------------+-------------------------------+
| chr2:47403162 | ["A","C"]      | (chr2:47403161-chr2:47403162) |
| chr2:47408388 | ["AACG","A"]   | [chr2:47408388-chr2:47408491) |

Is there a step that can then annotate with a bed using the overlap with the interval in the mt, instead of the locus?

Alternatively, is there a better way of doing this in hail that I am missing?

Also, I notice that the parse_locus_interval command gives position-1-x for an SNV/single base but position-x for a deletion/region is that expected?

Thanks a lot

Dan

Hi,

After a little bit more playing around:

>>> mt.region.show()
+---------------+----------------+-------------------------+
| locus         | alleles        | region                  |
+---------------+----------------+-------------------------+
| locus<GRCh38> | array<str>     | interval<locus<GRCh38>> |
+---------------+----------------+-------------------------+
| chr1:14397    | ["CTGT","C"]   | [chr1:14397-chr1:14400) |
| chr1:14436    | ["G","A"]      | (chr1:14435-chr1:14436) |
| chr1:14451    | ["C","T"]      | (chr1:14450-chr1:14451) |
| chr1:14455    | ["T","TAAGAG"] | (chr1:14454-chr1:14455) |


>>> trf.show()
[Stage 0:>                                                          (0 + 1) / 1]+---------------------------+--------+
| interval                  | target |
+---------------------------+--------+
| interval<locus<GRCh38>>   | str    |
+---------------------------+--------+
| [chr1:10001-chr1:10469)   | "trf"  |
| [chr1:30863-chr1:30960)   | "trf"  |
| [chr1:44836-chr1:44877)   | "trf"  |
| [chr1:50482-chr1:50514)   | "trf"  |
| [chr1:54713-chr1:54818)   | "trf"  |

I think the interval.overlaps() method is what I want but it won’t work.

If for example I run it on the trf table on itself:

>>> trf.interval.overlaps(trf.interval).show()
[Stage 2:>                                                          (0 + 1) / 1]2021-03-31 09:53:56 Hail: INFO: Ordering unsorted dataset with network shuffle
[Stage 6:>                                                          (0 + 1) / 1]+---------------------------+--------+
| interval                  | <expr> |
+---------------------------+--------+
| interval<locus<GRCh38>>   |   bool |
+---------------------------+--------+
| [chr1:10001-chr1:10469)   |   true |
| [chr1:30863-chr1:30960)   |   true |
| [chr1:44836-chr1:44877)   |   true |
| [chr1:50482-chr1:50514)   |   true |

But getting an overlap between the mt.region column and trf.interval does not work:

>>> mt.region.overlaps(trf.interval).show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-588>", line 2, in overlaps
  File "/data/rds-cache/DGE/DUDGE/MOPOPGEN/soft/conda/hail/lib/python3.6/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/data/rds-cache/DGE/DUDGE/MOPOPGEN/soft/conda/hail/lib/python3.6/site-packages/hail/expr/expressions/typed_expressions.py", line 3601, in overlaps
    return self._method("overlaps", tbool, interval)
  File "/data/rds-cache/DGE/DUDGE/MOPOPGEN/soft/conda/hail/lib/python3.6/site-packages/hail/expr/expressions/base_expression.py", line 595, in _method
    indices, aggregations = unify_all(self, *args)
  File "/data/rds-cache/DGE/DUDGE/MOPOPGEN/soft/conda/hail/lib/python3.6/site-packages/hail/expr/expressions/base_expression.py", line 353, in unify_all
    )) from None
hail.expr.expressions.base_expression.ExpressionException: Cannot combine expressions from different source objects.
    Found fields from 2 objects:
        <hail.matrixtable.MatrixTable object at 0x7effb9f60f60>: ['region']
        <hail.table.Table object at 0x7effba0bf940>: ['interval']

Any ideas? An alternative would be to iterate through each row of the mt ‘region’ and run an overlaps command on that, would that be possible?

Thanks

Dan

Can you give an example of the surprising behavior for hl.parse_locus_interval? You can check specific things like this:

hl.parse_locus_interval('chr1:123-124').show()

Yeah this is a bit tricky. You want an interval-interval join. I usually think of four kinds interval pairs:

1.
+---------+
   +---+

2.
     +---+
   +---+

3.
 +---+
   +---+

4.
 +---+
       +---+

You’ll want something like:

mt = mt.annotate(cases_1_and_3 = trf.index(mt.region.start, all_matches=True)])
mt = mt.annotate(cases_1_and_2 = trf.index(mt.region.end, all_matches=True)])
mt = mt.annotate(trfs = hl.set(case_1_and_3).union(hl.set(cases_1_and_2))]))

The first operation will be faster if you use mt.locus instead of mt.region.start. The second operation is sort of necessarily expensive, unfortunately.

Hi Dan, thanks for that - I’ll mentally unpack that and give it a go and get back to you!

Although it’s less likely how would you cover the case

      +---------+
+--------------------+

i.e. a case where the indel doesn’t start or end inside the region but spans it? So the reverse of case 1. ?

Thanks!

Dan

Ah, I thought I was missing a case! Good point. Let’s try that again:

mt = mt.annotate(cases_1_and_3 = trf.index(mt.region.start, all_matches=True)])
mt = mt.annotate(cases_1_and_2 = trf.index(mt.region.end, all_matches=True)])

# Hail, annoyingly, removes the key after a join, so we copy it first
xx = mt.annotate(locus_copy = mt.locus)
xx = xx.key_by(xx.region)
yy = trf.annotate(
    contained_mt_loci = xx.index(trf.interval.start, all_matches=True).locus_copy
)
yy = yy.explode(contained_mt_loci)
yy = yy.rename({'contained_mt_loci': 'contained_mt_locus'})
yy = yy.key_by(contained_mt_locus)
yy = yy.drop(yy.interval)

mt = mt.annotate(case5 = yy[mt.locus])
mt = mt.annotate(
    trfs = hl.set(case_1_and_3).union(hl.set(cases_1_and_2)).add(case5)])
)

Thanks! I’ll try it out :slight_smile: I think i need to do some reading about indexing in hail. I didn’t realize you could access a locus region index using a locus (that’s what I think is happening here).

For the parse_locus interval behaviour - I was just wondering about the way it represents the regions. e.g.

>>> hl.parse_locus_interval('chr1:123-124').show()
+-------------------------+
| <expr>                  |
+-------------------------+
| interval<locus<GRCh38>> |
+-------------------------+
| [chr1:123-chr1:124)     |
+-------------------------+

>>> hl.parse_locus_interval('chr1:123-123').show()
+-------------------------+
| <expr>                  |
+-------------------------+
| interval<locus<GRCh38>> |
+-------------------------+
| (chr1:122-chr1:123)     |
+-------------------------+

Are the regions 0 based, 1 based? 0 half open (bed?)

So if a region is a single base it looks like the region that is produced becomes 0 based? in this case 122-123 when the region was 123-123

when I put in 123-124 it remains the same.

Ah, yes, that’s a pervasive feature in Hail. If you have one table keyed by any type T and second table keyed by interval<T>, you can bring data from the second table onto the first table. It works for intervals of integers, intervals of strings, intervals of intervals of strings, etc.

Hail’s intervals are agnostic to properties of their “point type”, in your case: Locus. Hail’s loci are all 1-based because, as far as I know, that’s what GRCh37 and GRCh38 do.

Hail’s intervals can be open, half-open, or closed. This is indicated visually with ( for open and [ for closed. The first interval contains the point chr1:123 but excludes the point chr1:124. The second interval contains no points because it excludes both of its endpoints and there are no intervening points.

hl.parse_locus_interval defaults to left-inclusive, right-exclusive. That’s why the first interval is also left-inclusive and right-exclusive.

It would be, perhaps, less confusing if, for your second example, Hail produced the interval [chr1:123, chr1:123). I think we elect not to do that because it’s ambiguous: the left-endpoint is inclusive (so chr1:123 should be present) but the right-endpoint is exclusive (so chr1:123 should not be present).

All that said, I think Hail should produce (chr:123, chr:123) for your second example because it is less surprising.

Hi Dan, thanks again.

I’ve put a few tests together to try and make sure I get what’s happening here (apologies for the wall of text):

I have made a few changes after your explanation:

 mt=mt.annotate_rows(reg=hl.parse_locus_interval("["+mt.locus.contig+":"+hl.str(mt.locus.position)+"-"+hl.str(mt.locus.position+hl.len(mt.alleles[0]))+")"))

So, I annotate a region (reg) as the length of the reference allele only now (I realized you don’t really need to consider anything else as that determines how much of the reference would overlap with the bed). I use ‘[’ to start so it is left inclusive and ‘)’ so it is right exclusive. Exclusive because position+length will be 1 greater so it should be the same as pos+len-1+’]’ right?

Anyway, the table is:

>>> mt.reg.show()
+---------------+----------------+-------------------------+
| locus         | alleles        | reg                     |
+---------------+----------------+-------------------------+
| locus<GRCh38> | array<str>     | interval<locus<GRCh38>> |
+---------------+----------------+-------------------------+
| chr1:14397    | ["CTGT","C"]   | [chr1:14397-chr1:14401) |
| chr1:14436    | ["G","A"]      | [chr1:14436-chr1:14437) |
| chr1:14451    | ["C","T"]      | [chr1:14451-chr1:14452) |
| chr1:14455    | ["T","TAAGAG"] | [chr1:14455-chr1:14456) |
| chr1:14464    | ["A","T"]      | [chr1:14464-chr1:14465) |
| chr1:14469    | ["C","T"]      | [chr1:14469-chr1:14470) |
| chr1:14471    | ["C","T"]      | [chr1:14471-chr1:14472) |
| chr1:14487    | ["G","A"]      | [chr1:14487-chr1:14488) |
| chr1:14498    | ["G","A"]      | [chr1:14498-chr1:14499) |
+---------------+----------------+-------------------------+

So 1 base intervals e.g. chr1:14436 G/A are (properly?) represented as [chr1:14451-chr1:14452)

I’m a bit surprised by what reg.end.show() gives though:

>>> mt.reg.start.show()
+---------------+----------------+---------------+
| locus         | alleles        | <expr>        |
+---------------+----------------+---------------+
| locus<GRCh38> | array<str>     | locus<GRCh38> |
+---------------+----------------+---------------+
| chr1:14397    | ["CTGT","C"]   | chr1:14397    |
| chr1:14436    | ["G","A"]      | chr1:14436    |
| chr1:14451    | ["C","T"]      | chr1:14451    |
| chr1:14455    | ["T","TAAGAG"] | chr1:14455    |
| chr1:14464    | ["A","T"]      | chr1:14464    |
| chr1:14469    | ["C","T"]      | chr1:14469    |
| chr1:14471    | ["C","T"]      | chr1:14471    |
| chr1:14487    | ["G","A"]      | chr1:14487    |
| chr1:14498    | ["G","A"]      | chr1:14498    |
+---------------+----------------+---------------+

>>> mt.reg.end.show()
+---------------+----------------+---------------+
| locus         | alleles        | <expr>        |
+---------------+----------------+---------------+
| locus<GRCh38> | array<str>     | locus<GRCh38> |
+---------------+----------------+---------------+
| chr1:14397    | ["CTGT","C"]   | chr1:14401    |
| chr1:14436    | ["G","A"]      | chr1:14437    |
| chr1:14451    | ["C","T"]      | chr1:14452    |
| chr1:14455    | ["T","TAAGAG"] | chr1:14456    |
| chr1:14464    | ["A","T"]      | chr1:14465    |
| chr1:14469    | ["C","T"]      | chr1:14470    |
| chr1:14471    | ["C","T"]      | chr1:14472    |
| chr1:14487    | ["G","A"]      | chr1:14488    |
| chr1:14498    | ["G","A"]      | chr1:14499    |
+---------------+----------------+---------------+

If the region is exclusive of the final position I would expect that it isn’t included in the .end()? But it appears to be. If it still appears as a locus expression then will it be counted in the index of the bed table? I tested to see if it would but something more unexpected happenned:

Using an imported test bed file:

>>> test.show()
+-------------------------+--------+
| interval                | target |
+-------------------------+--------+
| interval<locus<GRCh38>> | str    |
+-------------------------+--------+
| [chr1:14397-chr1:14398) | "trf"  |
| [chr1:14437-chr1:14438) | "trf"  |
| [chr1:14464-chr1:14465) | "trf"  |
| [chr1:14496-chr1:14500) | "trf"  |
+-------------------------+--------+

adapting your code:

mt = mt.annotate_rows(cases_start = test.index(mt.reg.start, all_matches=True))
mt = mt.annotate_rows(cases_end = test.index(mt.reg.end, all_matches=True))

+---------------+----------------+----------------------------+
| locus         | alleles        | cases_start                |
+---------------+----------------+----------------------------+
| locus<GRCh38> | array<str>     | array<struct{target: str}> |
+---------------+----------------+----------------------------+
| chr1:14397    | ["CTGT","C"]   | [("trf")]                  |
| chr1:14436    | ["G","A"]      | []                         |
| chr1:14451    | ["C","T"]      | []                         |
| chr1:14455    | ["T","TAAGAG"] | []                         |
| chr1:14464    | ["A","T"]      | [("trf")]                  |
| chr1:14469    | ["C","T"]      | []                         |
| chr1:14471    | ["C","T"]      | []                         |
| chr1:14487    | ["G","A"]      | []                         |
| chr1:14498    | ["G","A"]      | [("trf")]                  |
+---------------+----------------+----------------------------+

>>>

+---------------+----------------+----------------------------+
| locus         | alleles        | cases_end                  |
+---------------+----------------+----------------------------+
| locus<GRCh38> | array<str>     | array<struct{target: str}> |
+---------------+----------------+----------------------------+
| chr1:14397    | ["CTGT","C"]   | [("trf")]                  |
| chr1:14436    | ["G","A"]      | []                         |
| chr1:14451    | ["C","T"]      | []                         |
| chr1:14455    | ["T","TAAGAG"] | []                         |
| chr1:14464    | ["A","T"]      | [("trf")]                  |
| chr1:14469    | ["C","T"]      | []                         |
| chr1:14471    | ["C","T"]      | []                         |
| chr1:14487    | ["G","A"]      | []                         |
| chr1:14498    | ["G","A"]      | [("trf")]                  |
+---------------+----------------+----------------------------+

Both commands give the same annotations. For the first variant:

| chr1:14397 | ["CTGT","C"] | [chr1:14397-chr1:14401) |

I would expect the index command to annotate the region using the start (chr1:14397) but not the end (chr1:14401 ) the region being: [chr1:14397-chr1:14398). Any idea what I’m missing here?

Thanks again,

Dan

Ah, is it because the command

mt = mt.annotate_rows(cases_end = test.index(mt.reg.end, all_matches=True))

is using the mt key (locus) and not mt.reg.end?

Yes, for integral values (like positions), [x, y) is equivalent to [x, y-1]. This isn’t true for floating-point / Real-number values.

You can also directly use hl.interval (this is faster because it avoids string construction and parsing):

hl.interval(mt.locus, 
            hl.locus(mt.locus.contig, 
                     mt.locus.position + hl.len(mt.alleles[0])),
            includes_end=False)

Yes, they’re represented that way. I don’t think there’s any “proper” representation, there are three Hail Intervals that contain the same (singleton) set of loci:

  • (chr1:14450-chr1:14451]
  • [chr1:14451-chr1:14451]
  • [chr1:14451-chr1:14452)

End refers to the end of the interval not the greatest value contained in the interval. For integers, we can always produce the greatest value contained in the interval. However, this is not true in general [1].

I think the central issue you’re highlighting is that Hail’s intervals are designed as general purpose intervals, but most of our users care about genomic intervals. Perhaps we should have a special datatype that works the way you expect. For the time being, you can implement “greatest genomic locus in the interval” yourself as follows.

def greatest_locus_in(interval):
    # ASSUMPTION: the interval spans only one chromosome
    locus_before_end = hl.locus(interval.end.contig, interval.end.position - 1)
    return (hl.case()
        # if the end is included, return it
        .when(interval.contains(interval.end),
              interval.end)
        # if the locus before the end exists (e.g. end is not chr1:1), and
        # the locus before the end is in the interval (e.g. not (chr1:10, chr1:10)), then
        # return the locus before the end
        .when((interval.end.position != 1) & interval.contains(locus_before_end),
              locus_before_end)
        # otherwise the interval is empty, so return missing
        .or_missing())

The index is using mt.reg.end. This looks like a bug to me, can you provide a small file and small, complete script that reproduces it?


[1] Consider, for example, the real-number interval [0, 1). There is no greatest value, 0.99 is greater than 0.9, 0.999 is greater than 0.99, etc. The “greatest floating-point value in an interval” exists, but that’s misleading because we often use floating-point numbers as an approximation of the real numbers.