Unable to export genotypes

CuriousGeneticist · October 8, 2022, 5:28pm

Hello all, I am processing a relatively large VCF file with ~6000 WES samples. I am trying to export the genotypes of rare loss-of-function “PASS” variants with this line (after filtering the mt):

mt_lof.GT.export("./GT.tsv")

However, I seem to be stuck on Stage 7 for a long time. Looking at the log file, this line was repeated for a few hours:

2022-10-08 12:17:10 ExternalSorter: INFO: Task 579 force spilling in-memory map to disk and it will release 189.6 MiB memory
2022-10-08 12:20:20 ShuffleBlockFetcherIterator: INFO: Getting 1 (8.6 MiB) non-empty blocks including 1 (8.6 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
2022-10-08 12:20:20 ShuffleBlockFetcherIterator: INFO: Started 0 remote fetches in 11 ms
2022-10-08 12:20:20 ExternalSorter: INFO: Task 579 force spilling in-memory map to disk and it will release 189.6 MiB memory
2022-10-08 12:21:39 ShuffleBlockFetcherIterator: INFO: Getting 1 (8.6 MiB) non-empty blocks including 1 (8.6 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
2022-10-08 12:21:39 ShuffleBlockFetcherIterator: INFO: Started 0 remote fetches in 0 ms

And it ended off with:

2022-10-08 13:17:23 root: INFO: RegionPool: FREE: 11.2M allocated (9.9M blocks / 1.2M chunks), regions.size = 2, 0 current java objects, thread 16: Thread-5
2022-10-08 13:17:23 DiskBlockManager: INFO: Shutdown hook called
2022-10-08 13:17:23 root: ERROR: OutOfMemoryError: GC overhead limit exceeded

I have already allocated 29GB of RAM and initiated it with spark_conf. Does anyone have any advice on this issue? Thanks a lot.

danking · October 9, 2022, 5:33pm

Hey @CuriousGeneticist !

Are you using a cluster or a laptop/server? It sounds like you’ve specified something in spark_conf. Can you share exactly how you initialized Hail? If you’re on a laptop/server you should initialize Hail as specified in this post.

Exporting the genotypes as a TSV file, as you’re doing above, converts from a relatively compact VCF-like format (one row per variant) to a relatively verbose format (one row per genotype).

Do you want the genotypes or just the variant names? If you just want the variant names do this:

mt_lof.rows().select().export('./GT.tsv')

if you want the genotypes, can you use a VCF export instead? That will be faster and smaller than a TSV export. Take a look at hl.export_vcf.

CuriousGeneticist · October 11, 2022, 1:19am

Thank you for your response @danking !

I am using a Jupyter notebook on a cluster. I specified this for my initialization:

hl.init(log='./log.log', spark_conf={'spark.driver.memory': '29g', 'spark.executor.memory':'29G'})

If I am not wrong, doing mt.GT.export gives us one row per variant, and then a column for each sample for the genotype. This is the output file that I desire. I am not too familiar in terms of dealing with a VCF file per se. Can I ask if I can check if throwing more memory can possibly allow me to export the genotype file? I have tried 100g RAM, but to no avail… (unless I am not initiating the memory properly)

danking · October 11, 2022, 2:02pm

Ah, sorry, you’re correct, mt.GT.export(...) should work as expected.

I don’t think you can specify driver and executor memory in hl.init for clusters. Are you using a Google Dataproc cluster? You should not need to specify anything in spark_conf. Just use the desired machine types when you start your cluster with hailctl dataproc start.

A couple other things you can try:

Use GT.tsv.bgz to compress the file, reducing the amount of data Hail has to write to disk.
Use parallel=True to write many separate files in parallel rather than one big file.

How many variants do you expect to survive your filters? In general, writing to a text file is a slow and error-prone process. I recommend instead saving the rare loss-of-function variants to a file, then reading that file, filtering, and writing to a text file. Like this:

mt.rows().select().write('gs://.../rare-lof-variants.ht')
rare_lof_variants = hl.read_table('gs://.../rare-lof-variants.ht')
print(f'{rare_lof_variants.count()} rows survived the filters')
mt = hl.read_matrix_table('gs://.../your-original-dataset.mt')
mt = mt.semi_join_rows(rare_lof_variants)
mt.GT.export("gs://.../GT.tsv.bgz", parallel=True)

I also notice your’e writing to a local path, not to gs://. I recommend writing to GCS, it’s more reliable than HDFS.

CuriousGeneticist · October 11, 2022, 2:37pm

I tried the gs method, but it mentioned that I do not have the gs scheme, and parallel does not seem to be a valid argument for my Hail version (0.2.97-937922d7f46c). Nevertheless, I am trying your way to do it (albeit w/o the gs and parallel) to see if it works! Thanks a lot @danking

danking · October 11, 2022, 2:41pm

Ah, I’ll fix the parallel bit.

How/where did you get this cluster?

CuriousGeneticist · October 11, 2022, 3:53pm

It is my institutional high-performance cluster. I tried your method (w/o the GS and parallel bit), and I seem to be still stuck at a stage with (0+1)/1 when it comes to the exporting of the GT

Looking at the log, I am seeing these lines repeating:

2022-10-11 11:51:17 ShuffleBlockFetcherIterator: INFO: Getting 1 (8.6 MiB) non-empty blocks including 1 (8.6 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
2022-10-11 11:51:17 ShuffleBlockFetcherIterator: INFO: Started 0 remote fetches in 0 ms
2022-10-11 11:51:24 ShuffleBlockFetcherIterator: INFO: Getting 1 (8.6 MiB) non-empty blocks including 1 (8.6 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
2022-10-11 11:51:24 ShuffleBlockFetcherIterator: INFO: Started 0 remote fetches in 0 ms
2022-10-11 11:51:28 ShuffleBlockFetcherIterator: INFO: Getting 1 (8.6 MiB) non-empty blocks including 1 (8.6 MiB) local and 0 (0.0 B) host-local and 0 (0.0 B) remote blocks
2022-10-11 11:51:28 ShuffleBlockFetcherIterator: INFO: Started 0 remote fetches in 0 ms

Do you know if it is something wrong with the original matrix table itself?

danking · October 11, 2022, 6:44pm

Ah! We don’t think of HPC clusters as “clusters” because jobs are usually single processes. Are you submitting a single job to the HPC cluster which invokes Python/Hail?

In that case, you must set the environment variables described here: How do I increase the memory or RAM available to the JVM when I start Hail through Python? - #2 by danking . For example, if you want to use 8GiB of RAM:

export PYSPARK_SUBMIT_ARGS="--driver-memory 8g --executor-memory 8g pyspark-shell"

If you don’t set these environment variables, Spark defaults to a very tiny amount of memory (regardless of what you put in hl.init).

CuriousGeneticist · October 11, 2022, 7:37pm

Oh okay that makes sense! I usually launch a Jupyter notebook session using the OnDemand service, but I will add that into the .bashrc and submit a simple python script to see if the commands will work

CuriousGeneticist · October 11, 2022, 8:01pm

I submitted it as a python script job with the env variable, but it is still the same issue, being stuck at a certain stage. I realized that for the rare LoF matrixtable, I am unable to even write the mt, do you think it may be something wrong upstream? (VCF issue?)

danking · October 11, 2022, 10:35pm

Hmm. The file shouldn’t affect the runtime performance. Can you share the script you’re executing or everything that leads up to the stall? It’s hard to diagnose this issue without seeing the exact command you’re executing.

CuriousGeneticist · October 12, 2022, 12:04pm

Thanks for helping me with this! This are the lines that I am executing:

import hail as hl
hl.init(log='./log.log', spark_conf={'spark.driver.memory': '29g', 'spark.executor.memory':'29G'})

#import stuff

mt = hl.read_matrix_table("vep.mt")   #Matrixtable that after VEP (~6000 WES samples)

mt_lof = mt.filter_rows(hl.set(['frameshift_variant', 'stop_gained', "splice_acceptor_variant", "splice_donor_variant"]).contains(mt.most_severe_consequence))
mt_lof = mt_lof.filter_rows(hl.len(mt_lof.filters) == 0)
mt_lof = mt_lof.filter_rows(mt_lof.gnomad_af > 0.001, keep=False)
mt_lof = mt_lof.filter_rows(hl.set(rf_line).contains(mt_lof.gene_symbol))   #I am filtering to a gene set of ~300 genes, rf_line is a py list

samples = mt_lof.aggregate_cols(hl.agg.take(mt_lof.s, 3000))  #I am taking the first 3000 samples
mt_lof = mt_lof.filter_cols(hl.literal(samples).contains(mt_lof.s))
mt_lof = mt_lof.filter_rows(hl.agg.any(mt_lof.GT.is_non_ref()))

mt_lof.GT.export("GT.tsv")

Please let me know if any line may be causing problems. I ran a very similar code for another MatrixTable of ~10k WES samples, and it worked fine (even w/o setting the environment variable for memory), so I am not too sure why it will not work here.

danking · October 12, 2022, 1:20pm

Hmm. None of this looks crazy to me. Can you share the Hail log file? I’ll ask the team to take a look

Topic		Replies	Views
Performance genotypes Hail Query & hailctl	3	387	October 6, 2020
Exporting Hail MT to VCF - Missing Genotypes Hail Query & hailctl	11	251	May 8, 2024
Still struggling with OutOfMemoryError: Java heap space Hail Query & hailctl	0	1087	March 4, 2019
Genotype matrix in hail 0.2 Hail Query & hailctl	5	750	April 15, 2019
Large scale ingest Hail Query & hailctl	7	694	April 8, 2019

Unable to export genotypes

Related topics