We would like to load a ~1Tb bgzipped VCF and write it in Hail format, but we’re running into issues (fails with what seems to be memory errors). We tried on a subset (5k+ lines) and it works great.
We’re running on Amazon EMR, and as far as I can tell, only one instance is being used at any time.
Try 1 (1 master + 3 core instances, 32 Gb mem per instance, failed at Stage 1 after ~ 11 hrs)
import hl as hail
Try 2 (1 master + 1 core instance, 128 Gb mem per instance, 2Tb EBS, failed at Stage 2 after 6 hrs and 45Gb input, Stage 1 completed after 14 hrs)
import hail as hl
hl.import_vcf(“s3://path/file.vcf.gz”,force=True, reference_genome=‘GRCh38’).write(“s3://path/hailout/file.mt”,overwrite=True, stage_locally=True)
Error message from pyspark-submit output is essentially: Fatal Python error: Cannot recover from stack overflow.
is the read time expected (14 hrs)?
is the read/write expected to be parallel?
how much memory and space should we be allocating for this? any ballpark estimation methods?
is there anything else we should be looking out for in terms of cluster specs, method parameters etc to make it faster and have it complete? change min_partitions?
can/should we do a per-chromosome load? what would be the best way to go about this? import_vcf with a list of vcfs? or import_vcf and write each chromosome separately then do a merge(somehow?) ?