Loading a large and growing cohort into Hail

mjberger · May 16, 2019, 11:10pm

Hi everyone,

I was wondering if there are any best practices for loading a large and growing cohort of samples into Hail? I stumbled upon this previous topic from Nov '16 with some good information:

Unfortunately, from what I can tell, adding samples incrementally is still challenging and would require continually merging vcf files as I receive new samples. Is this the case? While possible, this doesn’t seem like a nice solution since I’m working with a cohort of 100k+ samples.

Thank you for your time!

tpoterba · May 16, 2019, 11:34pm

The standard methods for combining gVCFs into project VCFs have broken down at the Broad as well, and so we are working on functionality inside Hail that will be able to assemble cohorts of >100,000 WGS samples.

This machinery is still very much experimental and in development, but I expect that in ~3 months it may be stable enough for a couple of advanced outside users.

The infrastructure contains three important components:

A representation for the data generally contained in a project VCF, but generated from gVCFs and represented in sparse form such that it scales with N_samples, rather than with N_samples^1.5 , like project VCFs.
A hierarchical merging algorithm for the representation described in (1), which can merge hundreds of thousands or millions of WGS samples for ~a few cents per sample. As a side effect, this algorithm also makes it possible to do incremental addition quite cheaply!
Functions that make it possible to replicate the standard Hail genetics functions like split_multi_hts or variant_qc on the representation described in (1). This includes a densify operation which realizes the project VCF in memory for computations that require processing that data, but does so more cheaply than reading all that data from disk would have been.

Topic		Replies	Views
Adding samples incrementally Feature Requests	4	2296	February 27, 2025
Import multi sample vcf Hail Query & hailctl	6	511	October 24, 2022
Large gVCF into VDS Hail Query & hailctl	21	188	June 11, 2025
Importing many sample-specific VCFs Hail Query & hailctl	12	1208	December 12, 2022
Appending to an existing Matrixtable Hail Query & hailctl	8	441	March 16, 2021

Loading a large and growing cohort into Hail

Related topics