We need to use the split_multi_hts function with permit_shuffle=True since in our case the condition This method assumes ds contains one non-split variant per locus is not satisfied. I wonder if there are any alternatives for processing VCF itself that we load at the very beginning into a MatrixTable, or in hail before running split_multi_hts so that to avoid permit_shuffle=True? It’s just having so large memory requirements that we are looking for any other way out. Also, one possible idea that we have if we need to retain permit_shuffle=True is to set spark.default.parallelism > 2000, but currently we are unsure regarding how much it could help and whether it will at all (performance - What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism? - Stack Overflow).
Where do these VCFs come from and what is their structure? Maybe it’s possible to split indels and snps separately, then union the two back together, for instance.
The VCF mostly consists of SNPs and small indel which are similar to SNPs. We are thinking that we could just normalize VCF and remove variants with undefined alelles (marked as *) - this is what the split_multi_hts does from my understanding.