Hi everyone, I have a whole-exome sequencing dataset that is processed by GATK, before importing into Hail. After restricting to exome intervals and sample_qc
, the average n_snp for each sample is roughly 35,000.
However, from the metrics obtained from the VCF using this tool (https://gatk.broadinstitute.org/hc/en-us/articles/360037057132-CollectVariantCallingMetrics-Picard-), the expected coding variants are in the range of 20,000 - 25,000.
I have already ruled out the contribution of splitting multi-allelic sites. Would the devs have any comments on the likely reason for the discrepancy? Thank you!