Large gVCF into VDS

mhebrard · May 19, 2025, 1:46am

Hi @gil - Thanks for the info. I am trying a slightly different geometry but I note your idea of adding 200 gVCFs to the previous VDS incrementally with the spark_max_stage_parallelism=‘20000’ and a high target_records.

Thanks for the insight on the cluster. I was wondering, do you specify data nodes ? or if not what is the disk space allocated ?

On my side, my last attempt was running on a cluster of 31 CORE nodes (r6g.4xlarge) x 150Gb (~4,500Gb disk) with 30,000 max stage parallelism.

rjanan · June 11, 2025, 10:30am

Hi everyone, im having similar problems with the VDS combiner. I have roughly 4000 gVCFs im trying to merge but because of the HPC constraints im getting ‘too many open files errors’. The most samples I can combine in one run is 200. Was wondering if anyone else has come across the same problem?

Im currently testing 500 samples with spark_max_stage_parallelism=‘20000’, target_records=1_100_000 and 50 branch factor. If anyone has any suggestions, please let me know.

Thank you!

Topic		Replies	Views
VDS combiner unsuccesful on large cohort Hail Query & hailctl	2	198	May 22, 2024
Possible vcf_combiner issue Hail Query & hailctl	19	1244	June 15, 2020
Help for finding rare variants for 100 patients Hail Query & hailctl	10	443	March 20, 2023
MethodTooLargeException when running vds.combiner Hail Query & hailctl	2	101	June 30, 2024
Importing many sample-specific VCFs Hail Query & hailctl	12	1216	December 12, 2022

Large gVCF into VDS

Related topics