Combining gVCFs into sparse matrix like gnomAD

For gnomAD v3.1, they announced the use of Hail to solve the joint calling program. Is there a tutorial describing the steps used in particular the append step.

  1. Are there differences with the resulting vcf than with the standard GATK pipeline using GenomicsDBImport and GenotypeGVCFs.
  2. Are these methods still considered experimental at this point?
  3. Can the resulting sparse matrix be used for qc/analyses or just the resulting vcf?

To create gnomAD v3, the first version of this genome release, we took advantage of a new sparse (but lossless) data format developed by Chris Vittal and Cotton Seed on the Hail team to store individual genotypes in a fraction of the space required by traditional VCFs. In a previous blog post describing this innovation, we noted that one advantage of this new format was the possibility of appending new data to existing callsets without needing to re-process samples already joint called as part of prior gnomAD releases—effectively solving the “N+1” joint calling problem.

For gnomAD v3.1, we made good on this promise, adding 4,598 new genomes in gVCF form to the already extant, joint-called gnomAD v3 callset stored in the sparse Hail Matrix Table format. This is, to our knowledge, the first time that this procedure has been done. Chris Vittal added the new genomes for us in six hours—shaving off almost a week of compute time (or several million core hours) that would have been required if we had created the callset from scratch.

1 Like

Thanks for the questions! All of our combiner documentation is here. I hope it can provide a detailed overview.

There is not currently a tutorial on the append at this time. We are working on interface changes to the combiner CLI tool and the run_combiner function that it wraps to make it easier to use and to support the append use case right out of the box.

There are two parts here, what the combiner outputs, and how it differs from GATK’s current pipeline. The short answer is yes, there are differences.

The VCF combiner does not produce vcfs and if exported directly, would not be a valid vcf. Instead it produces a hail native file called a matrix table, in a layout we’ve called the sparse matrix table. More on sparse matrix tables is available in our documentation here.

One of the goals the VCF combiner has is to faithfully preserve the input gVCFs in the output, save for some information we drop surrounding the <NON_REF> allele. It is possible to almost reconstruct any input gVCF from the sparse matrix table. No re-calibration of variants is performed, the alleles are what was in the gVCFs (padded if necessary to properly combine SNPs and Indels) with no limit on the number of alleles for a particular variant.

Yes they are. While we currently expect any changes to the details of the sparse matrix table data format to be backwards compatible. There may be changes to improve usability significantly that would warrant our changing of the sparse matrix table format. Additionally the interfaces, while usable, need to be improved before we commit to hail’s strong backwards compatibility promise of our stable interfaces.

Certainly! QC activities for gnomAD v3 and v3.1 took place predominantly on the sparse matrix table.

Thanks again!
Chris Vittal

I am testing the gvcf combiner with for 2002 samples*22 gvcfs. It is running slower than I expected given this is much smaller than the gnomad sample.
The job is running with 40 executors with 5 cores each. The is only part way through and it is still running after 2 days (see log below). Are there any suggestions for speeding up the gvcf combiner?

Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.46-6ef64c08b000
LOGGING: writing to /project/kageproj/combiner.log
2021-05-03 12:13:34 Hail: INFO: GVCF combiner plan:
    Branch factor: 100
    Batch size: 100
    Combining 44044 input files in 3 phases with 7 total jobs.
        Phase 1: 5 jobs corresponding to 441 intermediate output files.
        Phase 2: 1 job corresponding to 5 intermediate output files.
        Phase 3: 1 job corresponding to 1 final output file.

2021-05-03 12:13:34 Hail: INFO: Starting phase 1/3, merging 44044 input GVCFs in 5 jobs.
2021-05-03 12:13:34 Hail: INFO: Starting phase 1/3, job 1/5 to create 100 merged files, corresponding to ~7.6% of total I/O.
[Stage 100:===============================>                 (1606 + 200) / 2500]

Are these split by chromosome? That’s not a use case the Hail combiner supports currently, so the combiner is trying to merge 44044 samples with very strange missingness patterns.

Other than this, there have been performance improvements to the combiner since 0.2.46, so it’s worth upgrading (although we’ve discovered a memory leak in the latest release, 0.2.66. Wait until 0.2.67 to upgrade)

We upgraded hail to 2.67 and also merge the chromosomes into autosomal g.vcf.gz with the GATK gather tool. It is running much faster. Unexpectedly though we are finding that we find that the g.gvcf.gz that are being merged are now getting corrupted. This seems to be happening at the step when the matrix table is being written out.

Any ideas on this? Could it be a problem with the new hail install or the combiner script? During the last run, the g.vcf.gz fileswere made read only and the files were still corrupted.

This is quite surprising – none of the Hail code dealing with GVCFs writes, it only reads. What kind of corruption are you seeing? The tabix indices no longer work? Data overwritten?

If the files are read-only, the Hail process level that can modify them. Possibly a file system problem could cause this.

The issue related to the I/O overwhelming some drives (hadoop config issue) which takes them off line when the drive times out. When 3 are taken off line, it generates a warning the filesystem is corrupted since it can’t read some partitions.