Hail sample_qc results

Hi,
I’m new to hail.
I have a single sample VCF and tried generating sample_qc using it. However the results are inconsistent
code I’m running
mt = hl.sample_qc(mt)
mt.select_cols(mt.sample_qc.n_snp,
mt.sample_qc.n_deletion,
mt.sample_qc.call_rate,
mt.sample_qc.dp_stats.mean,
mt.sample_qc.n_insertion,
mt.sample_qc.r_insertion_deletion,
mt.sample_qc.r_het_hom_var,
mt.sample_qc.r_ti_tv)
.cols().show()

The number of n_snps it returns is 197k while the total number of variants are only 158k. Similar is the case for insertions and deletions, where it gives a count ~3k higher than the actual number present.

Could someone help me understand this.

Thanks,
Maleeha

to clarify, you’re saying that the number of records in the VCF is 158K?

It’s possible for n_snp or n_insertion to be higher than the number of rows, because these are defined (per the docs) as the number of alternate alleles in these categories. This should be rare, though.

What is the source of the VCF? Sequencing/genotyping?

The number of records in the VCF are 158k (which would include indels and snps) . While the number n_snp generated is 197k.

summarize_variants does generates the true number of snps present.
This is whole exome sequenced data

If you compute:

mt = mt.annotate_cols_(
  call_distribution = hl.agg.counter(mt.GT),
)
mt.cols().show()

What doesit print?

It essentially prints the same values for n_snp, insertion and deletions…etc.

what about the call_distribution value?

Instead, you can do:

mt = mt.annotate_cols_(
  call_distribution = hl.agg.counter(mt.GT),
)
print(mt.call_distribution.collect())

[frozendict({Call(alleles=[0, 1], phased=False): 86587, Call(alleles=[1, 1], phased=False): 66925, Call(alleles=[1, 2], phased=False): 397})]

It looks like you have no homozygous reference calls – is that what you expect?

Actually, in a single-sample project VCF, this might be what we expect…

can you paste the same qc results?

Yes, because a VCF for whole exome sequencing only contains variants, hom ref calls are removed.

2022-09-05 09:16:28 Hail: WARN: cols(): Resulting column table is sorted by ‘col_key’. To preserve matrix table column order, first unkey columns with ‘key_cols_by()’ 2022-09-05 09:43:49 Hail: INFO: Coerced sorted dataset 2022-09-05 09:44:10 Hail: INFO: Coerced sorted dataset 2022-09-05 09:44:11 Hail: INFO: Coerced sorted dataset

s n_snp n_deletion call_rate mean n_insertion r_insertion_deletion r_het_hom_var r_ti_tv
str int64 int64 float64 float64 int64 float64 float64 float64
“N” 191833 15125 1.00e+00 2.76e+01 13240 8.75e-01 1.30e+00 2.28e+00

And this is subset of summarize_variants:
2022-09-05 09:44:36 Hail: INFO: Coerced sorted dataset

Variant summary:

  • Total variants: 153909
  • Alleles per variant:
Number of alleles Count
2 153512
3 397
  • Counts by allele type:
Allele type Count
SNP 133411
Deletion 10960
Insertion 9201
MNP 734
  • Transitions/Transversions:
Metric Value
Transitions 92822
Transversions 40589
Ratio 2.29

This all looks good to me.

That’s the explanation for why the n_snp count is higher than the number of loci in your dataset. Lots of sites with homozygous alternate (1/1) calls drives this up.

So it basically counts each homozygous snp as 2? Alright, thanks a lot for clearing this out. So for the kind of output I need I should use summarize variants or something similar.

Thank you.

Maleeha