Outputting sample IDs for multi allelic sites

Hello all,

I have been following a very slightly modified version of the solution described in this post:

This has allowed me to query SNVs and determine which samples are het or hom for the SNV of interest. So far, I have not needed to be more specific beyond the interval of interest (e.g. chr18:31592920-31592921). However, I have recently run into a locus of interest that appears to be multi-allelic in our samples. Is there a way I can modify my query to be allele-specific, and then output the sample IDs for those individuals?
Or would I need to first filter on the locus, and then list which samples represent each allele?

Thank you!

Hi @Cecile_Avery,

Does your dataset have split multi-allelics? If not, you’ll need to use split_multi_hts to split multi-allelics into multiple SNVs, which can be queried individually.

For example, if the locus ‘chr18:31592920’ had ref allele “A” and alt alleles “C” and “G”, and you wanted to query the “C” allele specifically, you could use

mt = mt.filter_rows(mt.locus == hl.parse_locus('chr18:31592920'))
mt = hl.split_multi_hts(mt)
mt = mt.filter_rows(mt.alleles == ["A", "C"])

The first line is smart enough to skip directly to that row of the MatrixTable file on disk. Then split into multiple SNVs and filter to the desired one. Samples which had one copy of the “C” alt allele will show as het on the ["A", "C"] row after splitting.

Hope that helps!

That works great, thank you! My current script uses a slightly different filtering mechanism in a loop so that I can query multiple positions at once (the length of the list varies greatly). Is there a performance difference between parse_locus versus filter_intervals or a reason to use one versus the other?

See below for an example:

query_formatted = ['chr12:109596515-109596516', 'chr16:2767261-2767262']

mt_filtered = hl.filter_intervals(
    mt,
    [hl.parse_locus_interval(x,reference_genome='GRCh38')
     for x in query_formatted])

In order to automate this as much as possible, would you recommend that I split multi-allelics before or after filtering? And then filter additionally on alleles? Most queried alleles wouldn’t require this but I figure it would be easier to treat them all the same way.

filter_rows is smart about translating to a filter_intervals when possible. So the following are equivalent (assuming mt is keyed by locus):

query_formatted = ['chr12:109596515-109596516', 'chr16:2767261-2767262']
query_intervals = [hl.parse_locus_interval(x,reference_genome='GRCh38')
     for x in query_formatted])

mt_filtered = hl.filter_intervals(mt, query_intervals)
query_formatted = ['chr12:109596515-109596516', 'chr16:2767261-2767262']
query_intervals = [hl.parse_locus_interval(x,reference_genome='GRCh38')
     for x in query_formatted])

mt_filtered = mt.filter_rows(
    hl.array(query_intervals).any(lambda interval: interval.contains(mt.locus))
query_formatted = ['chr12:109596515', 'chr16:2767261']
query_loci = [hl.parse_locus(x,reference_genome='GRCh38')
     for x in query_formatted])

mt_filtered = mt.filter_rows(hl.set(query_loci).contains(mt.locus))

You can use whichever is clearer (I prefer the last).

I would recommend splitting multi-allelics before filtering, just to be sure it doesn’t waste time splitting loci that will be filtered out. It’s possible we would optimize it correctly to filter first even if you don’t explicitly write it that way (I would have to check), but in this case I think it’s safer to not rely on optimization.

Thank you for the thoughtful response! Filter_rows does seem to be much more flexible, and would actually work better for the files I’m pulling the loci from. I’m revisiting your previous answer about using filter_rows to specify the allele of interest…

mt = mt.filter_rows(mt.alleles == ["A", "C"])

I wonder if I should make a list of lists to be able to iterate over each locus being filtered while also being specific to which alternate allele I’m looking for. Is there any existing method for this sort of operation? Right now I do a manual check of the output but that won’t really scale in the future.

Any advice would be much appreciated.

I’ve stitched together the two operations you described in a while loop… I’m no expert, so I’m sure this is very inefficient. The calling on the resulting mt_allele_specific table (which has only 5 rows) seems to always run into a memory error. Other ways to generate a similar table (without trying to filter on specific alleles) doesn’t slow down as significantly or require much more memory. I’m not really sure why this is, or if there’s a better way to accomplish this?

num_of_variants = len(query_formatted)
counter = 1
while counter < num_of_variants:
    interval = query_formatted[counter]
    print(interval)
    print("I'm checking for the variant...")
    mt_allele_temp=hl.filter_intervals(
        row_agg_mt,
        [hl.parse_locus_interval(interval,)])
    mt_allele_temp = mt_allele_temp.filter_rows(mt_allele_temp.alleles == (alleles_formatted[counter]))
    if mt_allele_temp.count_rows() > 0:
        print("I'm merging the variant...")
        mt_allele_specific = mt_allele_specific.union_rows(mt_allele_temp)
        counter += 1
    else:
        print("I'm skipping this variant...")
        counter += 1
        continue

(Print statements were mostly to assure me that while it was running slowly, it was still making progress. The issue is moreso working with the final table).