I have a list of VCF files from specific ethnicity such as American Indian, Chinese, European etc
Under each ethnicity, I have around 100+ files.
Currently, I computed the
VARIANT QC metrics such as
n_het etc for one file as shown in the hail tutorial (refer image below)
image is here
However, now I would like to have one file for each ethnicity and then compute
I already referred to this post and this post but don’t think this addresses my query
How can I do this across all files under a specific ethnicity?
Can help me with this?
Is there any
hail way to do this? or any
R approach that veterans here are aware of??
You’re generally going to have a better time if you combine everything into one dataset and compute statistics on subsets from that dataset, rather than iterating through many smaller datasets. We need more information about these input VCFs, though. Are these project VCFs (GT, GQ, etc FORMAT fields) for a group of samples from sequencing data? Those cannot be losslessly combined (a site might appear in one VCF but not another).
If your VCFs are genotype data, it’s probably possible to combine since those have the same set of variants.
Yes, my VCF file contains fields such as GT, GP, etc FORMAT fields.
I also did a
mt.describe() for one file from each ethnicity and they all look the same as shown below
hail have any function to combine all these files together? I see VCF combiner but I believe this is different from what I am looking for (they talk about gVCF which I think is different from my file which has vcf.gz extension)