Unable to export genotypes

Hello all, I am processing a relatively large VCF file with ~6000 WES samples. I am trying to export the genotypes of rare loss-of-function “PASS” variants with this line (after filtering the mt):

mt_lof.GT.export("./GT.tsv")

However, I seem to be stuck on Stage 7 for a long time. Looking at the log file, this line was repeated for a few hours:

2022-10-08 12:17:10 ExternalSorter: INFO: Task 579 force spilling in-memory map to disk and it will release 189.6 MiB memory
2022-10-08 12:20:20 ShuffleBlockFetcherIterator: INFO: Getting 1 (8.6 MiB) non-empty blocks including 1 (8.6 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
2022-10-08 12:20:20 ShuffleBlockFetcherIterator: INFO: Started 0 remote fetches in 11 ms
2022-10-08 12:20:20 ExternalSorter: INFO: Task 579 force spilling in-memory map to disk and it will release 189.6 MiB memory
2022-10-08 12:21:39 ShuffleBlockFetcherIterator: INFO: Getting 1 (8.6 MiB) non-empty blocks including 1 (8.6 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
2022-10-08 12:21:39 ShuffleBlockFetcherIterator: INFO: Started 0 remote fetches in 0 ms

And it ended off with:

2022-10-08 13:17:23 root: INFO: RegionPool: FREE: 11.2M allocated (9.9M blocks / 1.2M chunks), regions.size = 2, 0 current java objects, thread 16: Thread-5
2022-10-08 13:17:23 DiskBlockManager: INFO: Shutdown hook called
2022-10-08 13:17:23 root: ERROR: OutOfMemoryError: GC overhead limit exceeded

I have already allocated 29GB of RAM and initiated it with spark_conf. Does anyone have any advice on this issue? Thanks a lot.

Hey @CuriousGeneticist !

Are you using a cluster or a laptop/server? It sounds like you’ve specified something in spark_conf. Can you share exactly how you initialized Hail? If you’re on a laptop/server you should initialize Hail as specified in this post.

Exporting the genotypes as a TSV file, as you’re doing above, converts from a relatively compact VCF-like format (one row per variant) to a relatively verbose format (one row per genotype).

Do you want the genotypes or just the variant names? If you just want the variant names do this:

mt_lof.rows().select().export('./GT.tsv')

if you want the genotypes, can you use a VCF export instead? That will be faster and smaller than a TSV export. Take a look at hl.export_vcf.

Thank you for your response @danking !

I am using a Jupyter notebook on a cluster. I specified this for my initialization:

hl.init(log='./log.log', spark_conf={'spark.driver.memory': '29g', 'spark.executor.memory':'29G'})

If I am not wrong, doing mt.GT.export gives us one row per variant, and then a column for each sample for the genotype. This is the output file that I desire. I am not too familiar in terms of dealing with a VCF file per se. Can I ask if I can check if throwing more memory can possibly allow me to export the genotype file? I have tried 100g RAM, but to no avail… (unless I am not initiating the memory properly)

Ah, sorry, you’re correct, mt.GT.export(...) should work as expected.

I don’t think you can specify driver and executor memory in hl.init for clusters. Are you using a Google Dataproc cluster? You should not need to specify anything in spark_conf. Just use the desired machine types when you start your cluster with hailctl dataproc start.

A couple other things you can try:

  • Use GT.tsv.bgz to compress the file, reducing the amount of data Hail has to write to disk.
  • Use parallel=True to write many separate files in parallel rather than one big file.

How many variants do you expect to survive your filters? In general, writing to a text file is a slow and error-prone process. I recommend instead saving the rare loss-of-function variants to a file, then reading that file, filtering, and writing to a text file. Like this:

mt.rows().select().write('gs://.../rare-lof-variants.ht')
rare_lof_variants = hl.read_table('gs://.../rare-lof-variants.ht')
print(f'{rare_lof_variants.count()} rows survived the filters')
mt = hl.read_matrix_table('gs://.../your-original-dataset.mt')
mt = mt.semi_join_rows(rare_lof_variants)
mt.GT.export("gs://.../GT.tsv.bgz", parallel=True)

I also notice your’e writing to a local path, not to gs://. I recommend writing to GCS, it’s more reliable than HDFS.

I tried the gs method, but it mentioned that I do not have the gs scheme, and parallel does not seem to be a valid argument for my Hail version (0.2.97-937922d7f46c). Nevertheless, I am trying your way to do it (albeit w/o the gs and parallel) to see if it works! Thanks a lot @danking

Ah, I’ll fix the parallel bit.

How/where did you get this cluster?

It is my institutional high-performance cluster. I tried your method (w/o the GS and parallel bit), and I seem to be still stuck at a stage with (0+1)/1 when it comes to the exporting of the GT

Looking at the log, I am seeing these lines repeating:

2022-10-11 11:51:17 ShuffleBlockFetcherIterator: INFO: Getting 1 (8.6 MiB) non-empty blocks including 1 (8.6 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
2022-10-11 11:51:17 ShuffleBlockFetcherIterator: INFO: Started 0 remote fetches in 0 ms
2022-10-11 11:51:24 ShuffleBlockFetcherIterator: INFO: Getting 1 (8.6 MiB) non-empty blocks including 1 (8.6 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
2022-10-11 11:51:24 ShuffleBlockFetcherIterator: INFO: Started 0 remote fetches in 0 ms
2022-10-11 11:51:28 ShuffleBlockFetcherIterator: INFO: Getting 1 (8.6 MiB) non-empty blocks including 1 (8.6 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
2022-10-11 11:51:28 ShuffleBlockFetcherIterator: INFO: Started 0 remote fetches in 0 ms

Do you know if it is something wrong with the original matrix table itself?

Ah! We don’t think of HPC clusters as “clusters” because jobs are usually single processes. Are you submitting a single job to the HPC cluster which invokes Python/Hail?

In that case, you must set the environment variables described here: How do I increase the memory or RAM available to the JVM when I start Hail through Python? - #2 by danking . For example, if you want to use 8GiB of RAM:

export PYSPARK_SUBMIT_ARGS="--driver-memory 8g --executor-memory 8g pyspark-shell"

If you don’t set these environment variables, Spark defaults to a very tiny amount of memory (regardless of what you put in hl.init).

Oh okay that makes sense! I usually launch a Jupyter notebook session using the OnDemand service, but I will add that into the .bashrc and submit a simple python script to see if the commands will work

I submitted it as a python script job with the env variable, but it is still the same issue, being stuck at a certain stage. I realized that for the rare LoF matrixtable, I am unable to even write the mt, do you think it may be something wrong upstream? (VCF issue?)

Hmm. The file shouldn’t affect the runtime performance. Can you share the script you’re executing or everything that leads up to the stall? It’s hard to diagnose this issue without seeing the exact command you’re executing.

Thanks for helping me with this! This are the lines that I am executing:

import hail as hl
hl.init(log='./log.log', spark_conf={'spark.driver.memory': '29g', 'spark.executor.memory':'29G'})

#import stuff

mt = hl.read_matrix_table("vep.mt")   #Matrixtable that after VEP (~6000 WES samples)

mt_lof = mt.filter_rows(hl.set(['frameshift_variant', 'stop_gained', "splice_acceptor_variant", "splice_donor_variant"]).contains(mt.most_severe_consequence))
mt_lof = mt_lof.filter_rows(hl.len(mt_lof.filters) == 0)
mt_lof = mt_lof.filter_rows(mt_lof.gnomad_af > 0.001, keep=False)
mt_lof = mt_lof.filter_rows(hl.set(rf_line).contains(mt_lof.gene_symbol))   #I am filtering to a gene set of ~300 genes, rf_line is a py list

samples = mt_lof.aggregate_cols(hl.agg.take(mt_lof.s, 3000))  #I am taking the first 3000 samples
mt_lof = mt_lof.filter_cols(hl.literal(samples).contains(mt_lof.s))
mt_lof = mt_lof.filter_rows(hl.agg.any(mt_lof.GT.is_non_ref()))

mt_lof.GT.export("GT.tsv")

Please let me know if any line may be causing problems. I ran a very similar code for another MatrixTable of ~10k WES samples, and it worked fine (even w/o setting the environment variable for memory), so I am not too sure why it will not work here.

Hmm. None of this looks crazy to me. Can you share the Hail log file? I’ll ask the team to take a look