Mismatch between Hail and GATK/Picard counting?

Ken · April 25, 2022, 2:58pm

Hi everyone, I have a whole-exome sequencing dataset that is processed by GATK, before importing into Hail. After restricting to exome intervals and sample_qc, the average n_snp for each sample is roughly 35,000.

However, from the metrics obtained from the VCF using this tool (https://gatk.broadinstitute.org/hc/en-us/articles/360037057132-CollectVariantCallingMetrics-Picard-), the expected coding variants are in the range of 20,000 - 25,000.

I have already ruled out the contribution of splitting multi-allelic sites. Would the devs have any comments on the likely reason for the discrepancy? Thank you!

tpoterba · April 25, 2022, 3:16pm

Could you point me to picard documentation where they define the semantic meaning of this field?

In Hail, the n_snp produced by sample_qc is defined as “the number of SNP alternate alleles” per sample. This means a 1/1 genotype for a SNP alternate allele counts as 2.

Topic		Replies	Views
Hail sample_qc results Hail Query & hailctl	15	451	September 7, 2022
How does hail generates the sample call rate? Hail Query & hailctl	2	482	September 23, 2020
Filtering samples with extreme heterozygosity in hail? Hail Query & hailctl	7	1231	February 19, 2020
Code check to run WES Hail Query & hailctl	2	553	July 8, 2020
Counting Rows More Quickly in VDS Hail Query & hailctl	12	529	July 17, 2023

Mismatch between Hail and GATK/Picard counting?

Related topics