Can I liftover a hail table from GRCH37 to GRCH38

#1

Dear Hail team:

I’m wondering if it is possible to liftover a table from GRCH37 to GRCH38? I realized there is a function to liftover a locus, but can I liftover an entire hail table? Thanks!

Best regards,
Wei

#2

Konrad is planning to add a bit about this in the how-to, but here’s his example:

Lift over the locus coordinates in a Table or MatrixTable from reference
genome 'GRCh37' to 'GRCh38':


    >>> rg37 = hl.get_reference('GRCh37')  # doctest: +SKIP
    >>> rg38 = hl.get_reference('GRCh38')  # doctest: +SKIP
    >>> rg37.add_liftover('gs://hail-common/references/grch37_to_grch38.over.chain.gz', rg38)  # doctest: +SKIP
    >>> ht = ht.annotate(new_locus=hl.liftover(ht.locus, 'GRCh38'), old_locus=ht.locus)  # doctest: +SKIP
    >>> ht = ht.filter(hl.is_defined(ht.new_locus))  # doctest: +SKIP
    >>> ht = ht.key_by(locus=ht.new_locus)  # doctest: +SKIP
2 Likes
#3

Hi Tim,

Thank you very much! That solves the problem.

Best regards,
Wei

#4

Hi Tim,

Can I follow up with a related question? I successfully lifted over an hail table using your code. Now I want to annotate a MatrixTable using this Table with the following command:

mt = mt.annotate_rows(gnomAD = gn[mt.locus, mt.alleles])

But I got the following error message. My suspicion is that there are some variants in MatrixTable that are not included in the Table, which cause an error when calling “gn[mt.locus, mt.alleles]”. Could you please let me know how to fix this issue?

Hail version: 0.2.8-70304a52d33d
Error summary: SparkException: Job aborted due to stage failure: Task 340 in stage 49.0 failed 4 times, most recent failure: Lost task 340.3 in stage 49.0 (TID 174347, ip-10-66-50-98.goldfinch.lan, executor 62): ExecutorLostFailure (executor 62 exited caused by one of the running tasks) Reason: Container marked as failed: container_1547607392313_0230_01_173099 on host: ip-10-66-50-98.goldfinch.lan. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal

Thank you very much!
Wei

#5

this error (137) is usually a Spark out-of-memory exception. The code you’re running seems perfectly fine.

What kind of cluster are you running on?

#6

Thanks for your reply!
I’m using a 24-node cluster from AWS, with 192 VCores and 576 Gb memory.

#7

what is the full pipeline you are running?

#8

Load data

mt = hl.read_matrix_table(‘all.SNVs.vqsr.mt’)

gnomAD annots

gn = hl.read_table(‘gnomad.genomes.r2.1.sites.ht’)

rg37 = hl.get_reference(‘GRCh37’)
rg38 = hl.get_reference(‘GRCh38’)
rg37.add_liftover(‘grch37_to_grch38.over.chain.gz’, rg38)
gn = gn.annotate(new_locus=hl.liftover(gn.locus, ‘GRCh38’), old_locus=gn.locus)
gn = gn.filter(hl.is_defined(gn.new_locus))
gn = gn.key_by(locus=gn.new_locus, alleles=gn.alleles)

mt = mt.annotate_rows(gnomAD = gn[mt.locus, mt.alleles])
mt.rows().show()

#9

How many partitions in the MT and the table? Generally, you can fix memory issues by increasing the number of partitions. I’d try doubling the number of partitions on both.

#10

I have 67376 partitions for the MT and 10000 partitions for the table. Thanks for the suggestion. I will try to increase these numbers.

split this topic #11

A post was split to a new topic: Liftover NullPointerException