For gnomAD v3.1, they announced the use of Hail to solve the joint calling program. Is there a tutorial describing the steps used in particular the append step.
- Are there differences with the resulting vcf than with the standard GATK pipeline using GenomicsDBImport and GenotypeGVCFs.
- Are these methods still considered experimental at this point?
- Can the resulting sparse matrix be used for qc/analyses or just the resulting vcf?
To create gnomAD v3, the first version of this genome release, we took advantage of a new sparse (but lossless) data format developed by Chris Vittal and Cotton Seed on the Hail team to store individual genotypes in a fraction of the space required by traditional VCFs. In a previous blog post describing this innovation, we noted that one advantage of this new format was the possibility of appending new data to existing callsets without needing to re-process samples already joint called as part of prior gnomAD releases—effectively solving the “N+1” joint calling problem.
For gnomAD v3.1, we made good on this promise, adding 4,598 new genomes in gVCF form to the already extant, joint-called gnomAD v3 callset stored in the sparse Hail Matrix Table format. This is, to our knowledge, the first time that this procedure has been done. Chris Vittal added the new genomes for us in six hours—shaving off almost a week of compute time (or several million core hours) that would have been required if we had created the callset from scratch.