The standard methods for combining gVCFs into project VCFs have broken down at the Broad as well, and so we are working on functionality inside Hail that will be able to assemble cohorts of >100,000 WGS samples.
This machinery is still very much experimental and in development, but I expect that in ~3 months it may be stable enough for a couple of advanced outside users.
The infrastructure contains three important components:
- A representation for the data generally contained in a project VCF, but generated from gVCFs and represented in sparse form such that it scales with Nsamples, rather than with Nsamples1.5 , like project VCFs.
- A hierarchical merging algorithm for the representation described in (1), which can merge hundreds of thousands or millions of WGS samples for ~a few cents per sample. As a side effect, this algorithm also makes it possible to do incremental addition quite cheaply!
- Functions that make it possible to replicate the standard Hail genetics functions like
variant_qc on the representation described in (1). This includes a
densify operation which realizes the project VCF in memory for computations that require processing that data, but does so more cheaply than reading all that data from disk would have been.