Large gVCF into VDS

Hi @gil - Thanks for the info. I am trying a slightly different geometry but I note your idea of adding 200 gVCFs to the previous VDS incrementally with the spark_max_stage_parallelism=‘20000’ and a high target_records.

Thanks for the insight on the cluster. I was wondering, do you specify data nodes ? or if not what is the disk space allocated ?

On my side, my last attempt was running on a cluster of 31 CORE nodes (r6g.4xlarge) x 150Gb (~4,500Gb disk) with 30,000 max stage parallelism.

Hi everyone, im having similar problems with the VDS combiner. I have roughly 4000 gVCFs im trying to merge but because of the HPC constraints im getting ‘too many open files errors’. The most samples I can combine in one run is 200. Was wondering if anyone else has come across the same problem?

Im currently testing 500 samples with spark_max_stage_parallelism=‘20000’, target_records=1_100_000 and 50 branch factor. If anyone has any suggestions, please let me know.

Thank you!

Hi, I’m running the VDS combiner on a large WGS cohort and hitting a ClassTooLargeException at the final merge step.

Error:

ClassTooLargeException: Class too large: __C6353collect_distributed_array_matrix_multi_writer

Setup:

  • Hail version: 0.2.135-034ef3e08116

  • Backend: Hail Batch

  • ~8,000 gVCFs (WGS, GRCh38)

  • snippet

    
    combiner = hl.vds.new_combiner(
        output_path=VDS_PATH,
        temp_path=TMP_PATH,
        gvcf_paths=manifest['output_vcf'].tolist(),
        gvcf_sample_names=manifest['collaborator_sample_id'].tolist(),
        gvcf_external_header=HEADER_PATH,
        use_genome_default_intervals=True,
        reference_genome='GRCh38',
        branch_factor=16,
        gvcf_batch_size=25,
        target_records=100_000  
    )
    
    combiner.run()
    
  • All preceding tasks succeeded (64,664 jobs); error occurs only at the final merge/write step

Has anyone encountered this issue or have suggestions for resolving it? Thanks in advance!