Stuck at writing Hail Table

Hi all, I have a MatrixTable consisting of around 300+ trios. I am trying to filter out inherited heterozygous variants with this code:

ped = "ped.fam"
pedigree = hl.Pedigree.read(ped)

td = hl.trio_matrix(mt, pedigree, complete_trios=True)

td_calls = td.entries()
calls = td_calls.filter(
    (td_calls.proband_entry.GT.is_het() & ~td_calls.father_entry.GT.is_hom_ref() & td_calls.mother_entry.GT.is_hom_ref())
    | (td_calls.proband_entry.GT.is_het() & td_calls.father_entry.GT.is_hom_ref() & ~td_calls.mother_entry.GT.is_hom_ref())
    | (td_calls.proband_entry.GT.is_het() & td_calls.father_entry.GT.is_het() & ~td_calls.mother_entry.GT.is_hom_var())
    | (td_calls.proband_entry.GT.is_het() & ~td_calls.father_entry.GT.is_hom_var() & td_calls.mother_entry.GT.is_het())
)

calls = calls.filter(hl.len(calls.filters) == 0)
calls = calls.filter(calls.gnomad_af < 0.001)

calls.write('inherited.ht')

Unfortunately, this stage just lasts forever. For the lack of a better way to explain it, it just shows four lines of Hail: INFO: Ordering unsorted dataset with network shuffle1 without a loading bar, and it has been running for 2 days. I am writing the Hail table first as I am planning to do more downstream filtering and annotation, which ran into other issues, so I thought that it’d be better to make a checkpoint first. I’d appreciate any input to make this process more efficient!

MatrixTable.entries is memory hungry and slow. Avoid it unless you’ve substantially reduced the amount of data you have.

IIUC, you want to filter trio at a variant not entire variants. What is your ultimate goal? Will you perform some sort of analysis? Do you want a list of trios per variant?

Without knowing more about your end goal, my best recommendation is to use filter_entries:

td = td.filter_entries(
      (td.proband_entry.GT.is_het() & ~td.father_entry.GT.is_hom_ref() &  td.mother_entry.GT.is_hom_ref())
    | (td.proband_entry.GT.is_het() &  td.father_entry.GT.is_hom_ref() & ~td.mother_entry.GT.is_hom_ref())
    | (td.proband_entry.GT.is_het() &  td.father_entry.GT.is_het()     & ~td.mother_entry.GT.is_hom_var())
    | (td.proband_entry.GT.is_het() & ~td.father_entry.GT.is_hom_var() &  td.mother_entry.GT.is_het())
)
td = td.filter_rows(hl.len(td.filters) == 0)
td = td.filter_rows(td.gnomad_af < 0.001)

Thanks @danking. My goal is to export a list of annotated inherited variants for a subset of genes and trios. With your method above, will you have any advice on how to annotate the trios with inherited variants for each row? It will be very helpful to have that information if I export the rows of the mt.

I have tried working with the Trio matrix (as above), and also with these steps:

td = td.filter_rows(hl.set(genes_of_interest_list).contains(td.gene_symbol))
td = td.filter_entries(hl.set(samples_of_interest_list).contains(td.proband.s))
td.row.export('inherited.tsv')

But I seem to be running into the same issue with four lines of Hail: INFO: Ordering unsorted dataset with network shuffle1 without a loading bar. Do you know if it is possible to troubleshoot this?

Can you share the whole pipeline in one script block?

Instead of writing to an uncompressed TSV, I instead recommend saving as a hail native format and then converting.

Let’s also go directly to the data you want:

td = td.annotate_rows(
    kept_trios = hl.agg.collect((td.proband.s, td.mother.s, td.father.s))
)
ht = td.rows().select('kept_trios')
ht.write('.../result.ht')
ht = hl.read_table('.../result.ht')
ht.export('foo.tsv.bgz')

If you prefer one row per family and variant, you can explode:

ht = ht.explode(kept_trios)
ht = ht.rename({'kept_trios': 'trio'})

Thanks a lot for showing me how to annotate it with the trios. Here is the code chunk:

mt = hl.read_matrix_table('vep.mt') #VCF that has been imported, split, and undergone VEP

intervals = "intervals.bed"
exome_intervals = hl.import_locus_intervals(intervals, reference_genome='GRCh37')
mt = mt.filter_rows(hl.is_defined(exome_intervals[mt.locus]), keep=True)

mt = hl.sample_qc(mt, name='sample_qc')
mt = hl.variant_qc(mt, name='variant_qc')

ped = 'ped.fam'
pedigree = hl.Pedigree.read(ped)

td = hl.trio_matrix(mt, pedigree, complete_trios=True)

td = td.filter_entries(
      (td.proband_entry.GT.is_het() & ~td.father_entry.GT.is_hom_ref() &  td.mother_entry.GT.is_hom_ref())
    | (td.proband_entry.GT.is_het() &  td.father_entry.GT.is_hom_ref() & ~td.mother_entry.GT.is_hom_ref())
    | (td.proband_entry.GT.is_het() &  td.father_entry.GT.is_het()     & ~td.mother_entry.GT.is_hom_var())
    | (td.proband_entry.GT.is_het() & ~td.father_entry.GT.is_hom_var() &  td.mother_entry.GT.is_het())
)
td = td.filter_rows(hl.len(td.filters) == 0)
td = td.filter_rows(td.gnomad_af < 0.001)

td = td.filter_rows(hl.set(genes).contains(td.gene_symbol))    #python list of genes
td = td.filter_entries(hl.set(samps).contains(td.proband.s))   #python list of samples

td.write('td.mt', overwrite = True)

This doesn’t really work for me for some reason (memory problem?). Can I get some advice on this? Thank you very much!! If it helps, I am working on a high-performance cluster, and I initialized the memory with hl.init(log='log.log',spark_conf={'spark.driver.memory': '20g', 'spark.executor.memory': '20g'})

What error message do you get? Can you upload the log file to gist.github.com and link it here?

I ran into this error:

Hail version: 0.2.97-937922d7f46c
Error summary: OutOfMemoryError: GC overhead limit exceeded

I will get the log file and message you. Thanks a lot!

How many cores is Spark using the N here:

[Stage7:===========>                              (14174 + N) / 62500]

You can control this with this:

hl.init(master='local[4]')  # use 4 cores

I’m not quite sure what’s going on. Your pipeline shouldn’t blow memory. We assume ~4GiB per core per executor.

Can you try starting it with this environment variables instead:

export PYSPARK_SUBMIT_ARGS='--executor-memory 20G --driver-memory 20G pyspark-shell'

Thank you for your comments. I believe N was 4 in my scenario as well. I am launching a Jupyter Notebook session to use Hail, do you know how adding the environment variable work? I am not sure if adding to .bashrc will work.

1 Like

You can set it in Python if you do that before you import Hail:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = ...