Stuck at writing Hail Table

MsUTR · January 12, 2023, 2:09pm

Hi all, I have a MatrixTable consisting of around 300+ trios. I am trying to filter out inherited heterozygous variants with this code:

ped = "ped.fam"
pedigree = hl.Pedigree.read(ped)

td = hl.trio_matrix(mt, pedigree, complete_trios=True)

td_calls = td.entries()
calls = td_calls.filter(
    (td_calls.proband_entry.GT.is_het() & ~td_calls.father_entry.GT.is_hom_ref() & td_calls.mother_entry.GT.is_hom_ref())
    | (td_calls.proband_entry.GT.is_het() & td_calls.father_entry.GT.is_hom_ref() & ~td_calls.mother_entry.GT.is_hom_ref())
    | (td_calls.proband_entry.GT.is_het() & td_calls.father_entry.GT.is_het() & ~td_calls.mother_entry.GT.is_hom_var())
    | (td_calls.proband_entry.GT.is_het() & ~td_calls.father_entry.GT.is_hom_var() & td_calls.mother_entry.GT.is_het())
)

calls = calls.filter(hl.len(calls.filters) == 0)
calls = calls.filter(calls.gnomad_af < 0.001)

calls.write('inherited.ht')

Unfortunately, this stage just lasts forever. For the lack of a better way to explain it, it just shows four lines of Hail: INFO: Ordering unsorted dataset with network shuffle1 without a loading bar, and it has been running for 2 days. I am writing the Hail table first as I am planning to do more downstream filtering and annotation, which ran into other issues, so I thought that it’d be better to make a checkpoint first. I’d appreciate any input to make this process more efficient!

danking · January 12, 2023, 6:52pm

MatrixTable.entries is memory hungry and slow. Avoid it unless you’ve substantially reduced the amount of data you have.

IIUC, you want to filter trio at a variant not entire variants. What is your ultimate goal? Will you perform some sort of analysis? Do you want a list of trios per variant?

Without knowing more about your end goal, my best recommendation is to use filter_entries:

td = td.filter_entries(
      (td.proband_entry.GT.is_het() & ~td.father_entry.GT.is_hom_ref() &  td.mother_entry.GT.is_hom_ref())
    | (td.proband_entry.GT.is_het() &  td.father_entry.GT.is_hom_ref() & ~td.mother_entry.GT.is_hom_ref())
    | (td.proband_entry.GT.is_het() &  td.father_entry.GT.is_het()     & ~td.mother_entry.GT.is_hom_var())
    | (td.proband_entry.GT.is_het() & ~td.father_entry.GT.is_hom_var() &  td.mother_entry.GT.is_het())
)
td = td.filter_rows(hl.len(td.filters) == 0)
td = td.filter_rows(td.gnomad_af < 0.001)

MsUTR · January 12, 2023, 7:24pm

Thanks @danking. My goal is to export a list of annotated inherited variants for a subset of genes and trios. With your method above, will you have any advice on how to annotate the trios with inherited variants for each row? It will be very helpful to have that information if I export the rows of the mt.

MsUTR · January 12, 2023, 7:44pm

I have tried working with the Trio matrix (as above), and also with these steps:

td = td.filter_rows(hl.set(genes_of_interest_list).contains(td.gene_symbol))
td = td.filter_entries(hl.set(samples_of_interest_list).contains(td.proband.s))
td.row.export('inherited.tsv')

But I seem to be running into the same issue with four lines of Hail: INFO: Ordering unsorted dataset with network shuffle1 without a loading bar. Do you know if it is possible to troubleshoot this?

danking · January 12, 2023, 8:42pm

Can you share the whole pipeline in one script block?

Instead of writing to an uncompressed TSV, I instead recommend saving as a hail native format and then converting.

Let’s also go directly to the data you want:

td = td.annotate_rows(
    kept_trios = hl.agg.collect((td.proband.s, td.mother.s, td.father.s))
)
ht = td.rows().select('kept_trios')
ht.write('.../result.ht')
ht = hl.read_table('.../result.ht')
ht.export('foo.tsv.bgz')

danking · January 12, 2023, 8:43pm

If you prefer one row per family and variant, you can explode:

ht = ht.explode(kept_trios)
ht = ht.rename({'kept_trios': 'trio'})

MsUTR · January 13, 2023, 3:04pm

Thanks a lot for showing me how to annotate it with the trios. Here is the code chunk:

mt = hl.read_matrix_table('vep.mt') #VCF that has been imported, split, and undergone VEP

intervals = "intervals.bed"
exome_intervals = hl.import_locus_intervals(intervals, reference_genome='GRCh37')
mt = mt.filter_rows(hl.is_defined(exome_intervals[mt.locus]), keep=True)

mt = hl.sample_qc(mt, name='sample_qc')
mt = hl.variant_qc(mt, name='variant_qc')

ped = 'ped.fam'
pedigree = hl.Pedigree.read(ped)

td = hl.trio_matrix(mt, pedigree, complete_trios=True)

td = td.filter_entries(
      (td.proband_entry.GT.is_het() & ~td.father_entry.GT.is_hom_ref() &  td.mother_entry.GT.is_hom_ref())
    | (td.proband_entry.GT.is_het() &  td.father_entry.GT.is_hom_ref() & ~td.mother_entry.GT.is_hom_ref())
    | (td.proband_entry.GT.is_het() &  td.father_entry.GT.is_het()     & ~td.mother_entry.GT.is_hom_var())
    | (td.proband_entry.GT.is_het() & ~td.father_entry.GT.is_hom_var() &  td.mother_entry.GT.is_het())
)
td = td.filter_rows(hl.len(td.filters) == 0)
td = td.filter_rows(td.gnomad_af < 0.001)

td = td.filter_rows(hl.set(genes).contains(td.gene_symbol))    #python list of genes
td = td.filter_entries(hl.set(samps).contains(td.proband.s))   #python list of samples

td.write('td.mt', overwrite = True)

This doesn’t really work for me for some reason (memory problem?). Can I get some advice on this? Thank you very much!! If it helps, I am working on a high-performance cluster, and I initialized the memory with hl.init(log='log.log',spark_conf={'spark.driver.memory': '20g', 'spark.executor.memory': '20g'})

danking · January 13, 2023, 4:56pm

What error message do you get? Can you upload the log file to gist.github.com and link it here?

MsUTR · January 13, 2023, 5:16pm

I ran into this error:

Hail version: 0.2.97-937922d7f46c
Error summary: OutOfMemoryError: GC overhead limit exceeded

I will get the log file and message you. Thanks a lot!

danking · January 13, 2023, 8:00pm

How many cores is Spark using the N here:

[Stage7:===========>                              (14174 + N) / 62500]

You can control this with this:

hl.init(master='local[4]')  # use 4 cores

I’m not quite sure what’s going on. Your pipeline shouldn’t blow memory. We assume ~4GiB per core per executor.

Can you try starting it with this environment variables instead:

export PYSPARK_SUBMIT_ARGS='--executor-memory 20G --driver-memory 20G pyspark-shell'

MsUTR · January 14, 2023, 12:21am

Thank you for your comments. I believe N was 4 in my scenario as well. I am launching a Jupyter Notebook session to use Hail, do you know how adding the environment variable work? I am not sure if adding to .bashrc will work.

danking · February 6, 2023, 10:54pm

You can set it in Python if you do that before you import Hail:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ...

Topic		Replies	Views
Is there a way to get a list of non-reference heterozygous and homozygous variants by gene for each sample from a MatrixTable? Hail Query & hailctl	0	206	January 8, 2024
Method for inherited variants? Hail Query & hailctl	1	327	June 27, 2022
More efficient way to extract calls? Hail Query & hailctl	2	369	December 14, 2022
Extracing sample IDs into a Python List Hail Query & hailctl	0	31	August 8, 2024
Performance after MatrixTable filtering (repartition question) Hail Query & hailctl	7	1701	December 20, 2018

Stuck at writing Hail Table

Related topics