Hail MT Directory Size

hpatel96 · December 6, 2022, 8:42pm

Recently, I have been to utilize Hail to perform some analyses on the 1000 genomes phase 3 dataset (found at https://console.cloud.google.com/storage/browser/genomics-public-data/1000-genomes-phase-3/vcf;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false). I downloaded the entire dataset by chromosome and utilized bcftools to concat them into one large VCF. The file size for this is roughly 800GB. After reading this large VCF into Hail and then writing it out into one big MatrixTable file, I found the size of that directory was shockingly small (less than 50GB). How is this possible?

Is there any documentation that describes how Hail stores the sample and variant information in such a way that we can have a drastic reduction in file size? What compression is being used here?

Thanks!

Topic		Replies	Views
Hail MT Directory Size Hail Query & hailctl	9	384	December 7, 2022
Write compressed Tables/Matrices Hail Query & hailctl	1	135	March 25, 2024
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1886	August 23, 2024
Memory and disk space requirements Hail Query & hailctl	8	680	October 10, 2022
Cluster Size for Subsetting in Hail Hail Query & hailctl	3	394	March 10, 2020

Hail MT Directory Size

Related topics