Loading a large and growing cohort into Hail

Hi everyone,

I was wondering if there are any best practices for loading a large and growing cohort of samples into Hail? I stumbled upon this previous topic from Nov '16 with some good information:

Unfortunately, from what I can tell, adding samples incrementally is still challenging and would require continually merging vcf files as I receive new samples. Is this the case? While possible, this doesn’t seem like a nice solution since I’m working with a cohort of 100k+ samples.

Thank you for your time!

1 Like

The standard methods for combining gVCFs into project VCFs have broken down at the Broad as well, and so we are working on functionality inside Hail that will be able to assemble cohorts of >100,000 WGS samples.

This machinery is still very much experimental and in development, but I expect that in ~3 months it may be stable enough for a couple of advanced outside users.

The infrastructure contains three important components:

  1. A representation for the data generally contained in a project VCF, but generated from gVCFs and represented in sparse form such that it scales with Nsamples, rather than with Nsamples1.5 , like project VCFs.
  2. A hierarchical merging algorithm for the representation described in (1), which can merge hundreds of thousands or millions of WGS samples for ~a few cents per sample. As a side effect, this algorithm also makes it possible to do incremental addition quite cheaply!
  3. Functions that make it possible to replicate the standard Hail genetics functions like split_multi_hts or variant_qc on the representation described in (1). This includes a densify operation which realizes the project VCF in memory for computations that require processing that data, but does so more cheaply than reading all that data from disk would have been.