Calculation of mean depth

Hi hail team,

I checked the sample_qc.dp_stats.mean to calculate the mean depth of each sample.

Most samples were expected to be 30-40X, but only 20-30X came out.

When compared by sample, there was a difference between sample qc (Hail) and Depth of coverage (GATK).

I wonder why this difference appears. Could you tell me how to get mean depth from hail sample_qc?



I’m assuming you imported a project VCF to a Hail MatrixTable before running hl.sample_qc. The mean DP produced by GATK and Hail in this case cannot be the same, because project VCF is a lossy format which discards information about loci between sites where an individual in your dataset has a polymorphism.

Hail’s dp_stats.mean is defined as, for each sample, the sum of DP values for entries observed in your VCF, divided by the number of non-missing values of DP. I would expect this to be slightly lower than the true read coverage, especially if there are low-complexity (telomeric, centromeric) regions in your VCF which bias the inclusion of a locus toward low-depth, badly-covered positions where lots of indel variants appear.

Thank you for the explanation! I understood it well!