Hail MT Directory Size

hpatel96 · December 7, 2022, 7:16pm

I am wondering if someone can point me to the documentation regarding the schema used by Hail when it writes out a MatrixTable to the disk. I am performing a GWAS on the Phase 3 1000 Genomes dataset, total combined VCF size of roughly 800GB) and upon writing out the MatrixTable using .write() I get a .mt directory of size roughly 50GB.

How is such a drastic reduction in size possible? Is some information not being stored? Is it safe to delete the original VCF file without loss of information ?

tpoterba · December 7, 2022, 7:25pm

Were the VCFs you imported compressed or uncompressed? Hail native formats use compression that’s roughly equivalent to gzip’s compression ratio.

hpatel96 · December 7, 2022, 8:25pm

The VCFs were .gz compressed which I converted to block .gz using bcftools prior to loading them into Hail. Is it a good rule of thumb to compress the VCFs prior to creating the MatrixTable? Can you point me to the documentation that discusses Hail’s native format?

tpoterba · December 7, 2022, 8:30pm

It’s good to keep text files bgzipped, yes.

If the VCFs were compressed before import, I would expect the Hail MatrixTable to be roughly the same size on disk (between ~30% smaller to slightly bigger). You can run some sanity checks on the matrixtable by running

mt = hl.read_matrix_table(...)
print(mt.count())
mt.summarize()

hpatel96 · December 7, 2022, 8:46pm

Awesome! Thanks for the quick replies.

One last question, slightly unrelated, but I notice that when I am running a command such as filter_rows, Hail shows a progress bar that has information on “Stages”: i.e.
“[Stage 2:================> (146 + 4) / 480]”

I’m guessing 480 is the number of partitions. What is the 4 (number of spark workers?)? And does the Stage number have any relevancy?

tpoterba · December 7, 2022, 8:48pm

Stage number is the counter within the session of Spark scatter-gathers – hard to map it onto specific pieces of Hail queries.

In the progress bar readout (146 + 4) / 480, 480 is the total number of tasks (independently scheduled pieces of work) to do, 146 is the number complete, and 4 is the number in progress. This means you have 4 CPUs working on your query – not much!

hpatel96 · December 7, 2022, 8:51pm

No wonder it was going so slow! I’m running Hail with spark locally on a HPC using:
export PYSPARK_SUBMIT_ARGS=“–master local[4,2] pyspark-shell”
prior to hail.init()

I believe I have way more available cores than 4, probably somewhere between 32 and 64, on the cluster I am working on. Would changing:
export PYSPARK_SUBMIT_ARGS=“–master local[32,2] pyspark-shell”
be the way to go to allocate more resources to Hail?

tpoterba · December 7, 2022, 8:53pm

I’m unfamiliar with the 2-integer syntax in local[4,2] – I’ve always just used local[N] where N is the number of cores, so here local[32] or 64 will make this much faster!

hpatel96 · December 7, 2022, 8:56pm

Awesome thanks! This has been very helpful.

The two integer syntax just specifies the the number of cores followed by how many attempts should be made before giving up if spark fails to load on the first attempt.

tpoterba · December 7, 2022, 8:57pm

Ah! Learned something new about Spark today, thanks!

Topic		Replies	Views
Memory and disk space requirements Hail Query & hailctl	8	688	October 10, 2022
Hail MT Directory Size Hail Query & hailctl	0	191	December 6, 2022
Write compressed Tables/Matrices Hail Query & hailctl	1	135	March 25, 2024
Improve matrix write time? Hail Query & hailctl	19	783	October 29, 2019
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1913	August 23, 2024

Hail MT Directory Size

Related topics