Combining VCFs from gnomAD

Hi,
I’m interested in combining VCFs from gnomAD and was wondering if there’s an efficient way to do this in Hail (for example, parallelizing the operation by chromosome).

Thanks!

Hey @rye335

I need more context about your end goal. Almost all the gnomAD data should be internally and externally available in native Hail format (see here). What data is in this VCF that you are producing and for what purpose are you creating it?

We have several gnomAD v4 subsets that are in .vds format, that we want to combine with GVCFs generated at the Broad. Then we want to eventually add ~2000 GVCFs, and then subsequently calculate VQSR for the new combined dataset (vds1 + vds2 + 2000 GVCF). Afterwards, we would be doing downstream QC and rare variant analyses with the dataset.

It is not the best documented, but you’ll want to use the vds combiner. Docs Here

You will want to do something like this

vdses = [
    # list of the subset datasets
    ...
]

gvcfs = [
    # list of your samples
    ...
]

combiner = hl.vds.new_combiner(
    output_path='gs://bucket/dataset.vds',
    temp_path='gs://temp-bucket/vds-combiner',
    gvcf_paths=gvcfs,
    vds_paths=vdses,
    use_exome_default_intervals=True,
)

combiner.run()

vds = hl.read_vds('gs://bucket/dataset.vds')