Accessing intervals and alleles in matrix table

I’m working on a function to check that after filtering on locus and splitting multi-allelic sites, I am able to return the exact allele I want.

variant intervals is a list formatted as: [‘chr1:1000-1001’, ‘chr2:1000-1001’]
And alleles are a list of lists: [[‘A’, ‘T’], [‘C’, ‘G’]]

def filter_matrix_table(mt, variant_intervals_list, alleles_list):
    for i, alleles in enumerate(alleles_list):
        interval_start = variant_intervals_list[i].split("-")[0]

        mt = mt.filter_rows(
            (hl.str(mt.locus) == interval_start) &
            (mt.alleles == alleles)
        )
    return mt

I am working on a small sample and I know that it should not be returning an empty table. Am I approaching this in a way that is logically sound?.. At a glance it seems like it should work but I find that accessing values in matrix tables is not intuitive.

Thank you!

Hi @Cecile_Avery,

This code is building a big AND filter that keeps rows whose (locus, alleles) pair is equal to every interval endpoint and alleles in your lists.

Instead, you want to build a big OR filter that keeps ANY row with a (locus, alleles) pair in your lists, something like:

def filter_matrix_table(mt, variant_intervals_list, alleles_list):
    starts = [interval.split('-')[0] for interval in variant_intervals_list]
    return mt.filter_rows(
        hl.any( 
            hl.map(starts, alleles_list, lambda point, alleles: 
                (str(mt.locus) == point) & (mt.alleles == alleles)
            ))
        )
    )   

This is hideously inefficient for long variant and alleles lists but hopefully this helps unblock you for now!
Cheers,
Ed

Ahh, I see. Thank you! I’m going to poke around with your approach if not to learn more about .any and .map.

Unfortunately, I was writing this up with the end goal of processing a very long list of SNPs.
Do you have any general advice on how to do so? Or would it perhaps be better to get the data out of hail matrix tables post initial filtering and do the allele match with another data structure/format/approach?

In the end, we need to be able to access which samples carry specific alleles.

Hi @Cecile_Avery,

It sounds like you might be interested in some flavour of join; I think semi_join_rows would capture the semantics above (doc link below). You’ll need to format your variant loci and alleles lists as a table, keyed by (locus, alleles) (ie the row key of your matrix table). This should allow you to scale to long lists.

https://www.hail.is/docs/0.2/hail.MatrixTable.html#hail.MatrixTable.semi_join_rows

Hope this helps!
Cheers,
Ed