Recently, I have been to utilize Hail to perform some analyses on the 1000 genomes phase 3 dataset (found at https://console.cloud.google.com/storage/browser/genomics-public-data/1000-genomes-phase-3/vcf;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false). I downloaded the entire dataset by chromosome and utilized bcftools to concat them into one large VCF. The file size for this is roughly 800GB. After reading this large VCF into Hail and then writing it out into one big MatrixTable file, I found the size of that directory was shockingly small (less than 50GB). How is this possible?
Is there any documentation that describes how Hail stores the sample and variant information in such a way that we can have a drastic reduction in file size? What compression is being used here?