Parse_locus() over multiple regions using GRCh38

Hi,

my data is on GRCh38 and I want to filter a set of SNPs from this. Im am looking for something like this:

intervals = [hl.parse_locus(x, reference_genome='GRCh38') for x in ['4:144585304', '15:78575140']]
mt_hits = hl.filter_intervals(mt, intervals, keep = True)

which is giving the error: TypeError: filter_intervals: parameter ‘intervals’: expected expression of type array<interval>, found list: [<LocusExpression of type locus>, <LocusExpression of type locus>]

In 0.2.15 or later, doing the following:

intervals = [hl.parse_locus(x, reference_genome='GRCh38') for x in ['4:144585304', '15:78575140']]
mt_hits = mt.filter_rows(hl.literal(intervals).contains(mt.locus))

will have the same performance as filter_intervals, but is much easier to use. See here:

Thank you, I am now using HAIL version 0.2.15-652d93ae3419, but get this error:
Error summary: HailException: Invalid locus ‘4:144585304’ found. Contig ‘4’ is not in the reference genome ‘GRCh38’

I have also tried other loci of which I am also sure the should be in GRCh38, but all give the same error.

chromosome 4 is identified as chr4 in GRCh38

Works! Thanks :smile:

@tpoterba I’m doing a very similar thing here where I have a list of locuses I’d like to filter by in the format of ‘1:10556097’ etc. I tried appending a ‘chr’ to each value in the list convert it into what I think would work with GRch38 and that resulted is some more errors with certain locations being out of range and such.

Is there a way to convert a list of locuses to a format that would work with filter_rows?

Here is my current code:

#RSIDloc is a list of string values for different locuses
intervals = [hl.parse_locus(x, reference_genome=‘GRCh38’) for x in RSIDloc]
mt = mt.filter_rows(hl.literal(intervals).contains(mt.locus))
mt.show()

This gives the same error:

Error summary: HailException: Invalid locus ‘1:10556097’ found. Contig ‘1’ is not in the reference genome ‘GRCh38’

prepending chr is the right thing to do. If you’re getting errors about positions being out of those range, those are important errors – if your position is larger than the size of a chromosome in GRCh38, there’s no way to generate a sensible locus for that string.

I’m pulling these locus values from values I pulled off the ‘dbSNP_rsid’ schema/dataset that I used to search rsID values from. I guess I can go through and drop these values manually. Is there a better way to get locus values specifically for the GRCh38 instead of going through GRCh37?

I’m not sure I understand what you’re trying to do. It sounds like your loci are actually GRCh37 loci. Why are you trying to parse them as GRCh38 loci? Converting from GRCh37 variants to GRCh38 variants is a nontrivial process. We have some tools to help (look at hl.liftover) but you also have to decide what to do about the alleles, if those matter to you.

I was having a bit of a trouble with the hl.liftover function, but I found a way that worked for me. I just used the ‘dbSNP_rsid’ data set with GRCh38.

rsid = hl.experimental.load_dataset(name=‘dbSNP_rsid’,version=‘154’,reference_genome=‘GRCh38’)

I then used this hail table to get all the loci I needed from the list of rsid values I had.

I don’t know why I didn’t do this before.

1 Like