Appending/merging/combining multiple VCF files using Hail

Hello Everyone,

I have a list of VCF files from specific ethnicity such as American Indian, Chinese, European etc

Under each ethnicity, I have around 100+ files.

Currently, I computed the VARIANT QC metrics such as call_rate, n_het etc for one file as shown in the hail tutorial (refer image below)

image is here

However, now I would like to have one file for each ethnicity and then compute VARIANT_QC metrics.

I already referred to this post and this post but don’t think this addresses my query

How can I do this across all files under a specific ethnicity?

Can help me with this?

Is there any hail way to do this? or any python or R approach that veterans here are aware of??

You’re generally going to have a better time if you combine everything into one dataset and compute statistics on subsets from that dataset, rather than iterating through many smaller datasets. We need more information about these input VCFs, though. Are these project VCFs (GT, GQ, etc FORMAT fields) for a group of samples from sequencing data? Those cannot be losslessly combined (a site might appear in one VCF but not another).

If your VCFs are genotype data, it’s probably possible to combine since those have the same set of variants.

Hi @tpoterba

Yes, my VCF file contains fields such as GT, GP, etc FORMAT fields.

I also did a mt.describe() for one file from each ethnicity and they all look the same as shown below

image

Does hail have any function to combine all these files together? I see VCF combiner but I believe this is different from what I am looking for (they talk about gVCF which I think is different from my file which has vcf.gz extension)