Hi,
I’m new to hail.
I have a single sample VCF and tried generating sample_qc using it. However the results are inconsistent
code I’m running
mt = hl.sample_qc(mt)
mt.select_cols(mt.sample_qc.n_snp,
mt.sample_qc.n_deletion,
mt.sample_qc.call_rate,
mt.sample_qc.dp_stats.mean,
mt.sample_qc.n_insertion,
mt.sample_qc.r_insertion_deletion,
mt.sample_qc.r_het_hom_var,
mt.sample_qc.r_ti_tv)
.cols().show()
The number of n_snps it returns is 197k while the total number of variants are only 158k. Similar is the case for insertions and deletions, where it gives a count ~3k higher than the actual number present.
to clarify, you’re saying that the number of records in the VCF is 158K?
It’s possible for n_snp or n_insertion to be higher than the number of rows, because these are defined (per the docs) as the number of alternate alleles in these categories. This should be rare, though.
What is the source of the VCF? Sequencing/genotyping?
That’s the explanation for why the n_snp count is higher than the number of loci in your dataset. Lots of sites with homozygous alternate (1/1) calls drives this up.
So it basically counts each homozygous snp as 2? Alright, thanks a lot for clearing this out. So for the kind of output I need I should use summarize variants or something similar.