I am wondering if someone can point me to the documentation regarding the schema used by Hail when it writes out a MatrixTable to the disk. I am performing a GWAS on the Phase 3 1000 Genomes dataset, total combined VCF size of roughly 800GB) and upon writing out the MatrixTable using .write() I get a .mt directory of size roughly 50GB.
How is such a drastic reduction in size possible? Is some information not being stored? Is it safe to delete the original VCF file without loss of information ?
Were the VCFs you imported compressed or uncompressed? Hail native formats use compression that’s roughly equivalent to gzip’s compression ratio.
The VCFs were .gz compressed which I converted to block .gz using bcftools prior to loading them into Hail. Is it a good rule of thumb to compress the VCFs prior to creating the MatrixTable? Can you point me to the documentation that discusses Hail’s native format?
It’s good to keep text files bgzipped, yes.
If the VCFs were compressed before import, I would expect the Hail MatrixTable to be roughly the same size on disk (between ~30% smaller to slightly bigger). You can run some sanity checks on the matrixtable by running
mt = hl.read_matrix_table(...)
print(mt.count())
mt.summarize()
Awesome! Thanks for the quick replies.
One last question, slightly unrelated, but I notice that when I am running a command such as filter_rows, Hail shows a progress bar that has information on “Stages”: i.e.
“[Stage 2:================> (146 + 4) / 480]”
I’m guessing 480 is the number of partitions. What is the 4 (number of spark workers?)? And does the Stage number have any relevancy?
Stage number is the counter within the session of Spark scatter-gathers – hard to map it onto specific pieces of Hail queries.
In the progress bar readout (146 + 4) / 480
, 480 is the total number of tasks (independently scheduled pieces of work) to do, 146 is the number complete, and 4 is the number in progress. This means you have 4 CPUs working on your query – not much!
No wonder it was going so slow! I’m running Hail with spark locally on a HPC using:
export PYSPARK_SUBMIT_ARGS=“–master local[4,2] pyspark-shell”
prior to hail.init()
I believe I have way more available cores than 4, probably somewhere between 32 and 64, on the cluster I am working on. Would changing:
export PYSPARK_SUBMIT_ARGS=“–master local[32,2] pyspark-shell”
be the way to go to allocate more resources to Hail?
I’m unfamiliar with the 2-integer syntax in local[4,2]
– I’ve always just used local[N]
where N is the number of cores, so here local[32]
or 64 will make this much faster!
Awesome thanks! This has been very helpful.
The two integer syntax just specifies the the number of cores followed by how many attempts should be made before giving up if spark fails to load on the first attempt.
Ah! Learned something new about Spark today, thanks!