Alternatives to permit_shuffle=True in split_multi_hts

NLSVTN · July 7, 2021, 7:46pm

We need to use the split_multi_hts function with permit_shuffle=True since in our case the condition This method assumes ds contains one non-split variant per locus is not satisfied. I wonder if there are any alternatives for processing VCF itself that we load at the very beginning into a MatrixTable, or in hail before running split_multi_hts so that to avoid permit_shuffle=True? It’s just having so large memory requirements that we are looking for any other way out. Also, one possible idea that we have if we need to retain permit_shuffle=True is to set spark.default.parallelism > 2000, but currently we are unsure regarding how much it could help and whether it will at all (performance - What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? - Stack Overflow).

tpoterba · July 7, 2021, 8:00pm

Where do these VCFs come from and what is their structure? Maybe it’s possible to split indels and snps separately, then union the two back together, for instance.

NLSVTN · July 8, 2021, 4:37pm

The VCF mostly consists of SNPs and small indel which are similar to SNPs. We are thinking that we could just normalize VCF and remove variants with undefined alelles (marked as *) - this is what the split_multi_hts does from my understanding.

Topic		Replies	Views
How to split a huge VCF into chunks of 1000 variants Hail Query & hailctl	3	494	January 27, 2021
Distributing import_vcf and multi_way_union_mts across Spark workers Hail Query & hailctl	0	127	April 11, 2024
Issue with split_multi and/or split_multi_hts Hail Query & hailctl	0	450	July 22, 2022
VariantDatasetCombiner - dataset contains both multiallelic variants and duplicated loci for review Hail Query & hailctl	13	542	July 24, 2023
Store multiple vcfs into single MatrixTable Hail Query & hailctl	10	755	September 9, 2020

Alternatives to permit_shuffle=True in split_multi_hts

Related topics