We would like to load a ~1Tb bgzipped VCF and write it in Hail format, but we’re running into issues (fails with what seems to be memory errors). We tried on a subset (5k+ lines) and it works great.
We’re running on Amazon EMR, and as far as I can tell, only one instance is being used at any time.
Try 1 (1 master + 3 core instances, 32 Gb mem per instance, failed at Stage 1 after ~ 11 hrs)
Error message from pyspark-submit output is essentially: Fatal Python error: Cannot recover from stack overflow.
is the read time expected (14 hrs)?
is the read/write expected to be parallel?
how much memory and space should we be allocating for this? any ballpark estimation methods?
is there anything else we should be looking out for in terms of cluster specs, method parameters etc to make it faster and have it complete? change min_partitions?
can/should we do a per-chromosome load? what would be the best way to go about this? import_vcf with a list of vcfs? or import_vcf and write each chromosome separately then do a merge(somehow?) ?
Thank you!! Am working with OP on this - not long after posting we realised that we should be using force_bgz instead by combing through the docs and it works great now. I must’ve not read the HailException carefully enough (Probably just read the beginning where it said “treat as bgzipped” and the end of the paragraph where it said force=True, my bad).
Write still takes 4x as long, but I guess that is expected.
4x as long as what? Sorry, don’t know what you mean.
I’d also say that you’re using a pretty small cluster for a 1 TB VCF - I think the people around here would probably use ~500 cores at least for that task.
Hail scales pretty well so even though you pay more per hour, the task will finish faster, so the cost doesn’t go up much.
This second Job/Stage is what I’m saying is longer than the first.
Totally agree the cluster is small - we were just troubleshooting this part so wanted to see how it went. If increasing cores will help we will absolutely do that.
Ah. This first stage is parsing the VCF to scan the variants and look at ordering – it doesn’t need to parse the genotypes. The second stage is parsing all the data, writing, compressing, etc.