Hail sample_qc results

MaleehaZaman · August 30, 2022, 6:47am

Hi,
I’m new to hail.
I have a single sample VCF and tried generating sample_qc using it. However the results are inconsistent
code I’m running
mt = hl.sample_qc(mt)
mt.select_cols(mt.sample_qc.n_snp,
mt.sample_qc.n_deletion,
mt.sample_qc.call_rate,
mt.sample_qc.dp_stats.mean,
mt.sample_qc.n_insertion,
mt.sample_qc.r_insertion_deletion,
mt.sample_qc.r_het_hom_var,
mt.sample_qc.r_ti_tv)
.cols().show()

The number of n_snps it returns is 197k while the total number of variants are only 158k. Similar is the case for insertions and deletions, where it gives a count ~3k higher than the actual number present.

Could someone help me understand this.

Thanks,
Maleeha

tpoterba · August 30, 2022, 1:02pm

to clarify, you’re saying that the number of records in the VCF is 158K?

It’s possible for n_snp or n_insertion to be higher than the number of rows, because these are defined (per the docs) as the number of alternate alleles in these categories. This should be rare, though.

What is the source of the VCF? Sequencing/genotyping?

MaleehaZaman · August 30, 2022, 1:07pm

The number of records in the VCF are 158k (which would include indels and snps) . While the number n_snp generated is 197k.

summarize_variants does generates the true number of snps present.
This is whole exome sequenced data

tpoterba · August 30, 2022, 1:09pm

If you compute:

mt = mt.annotate_cols_(
  call_distribution = hl.agg.counter(mt.GT),
)
mt.cols().show()

What doesit print?

MaleehaZaman · August 30, 2022, 1:23pm

It essentially prints the same values for n_snp, insertion and deletions…etc.

tpoterba · August 30, 2022, 1:24pm

what about the call_distribution value?

tpoterba · August 30, 2022, 1:25pm

Instead, you can do:

mt = mt.annotate_cols_(
  call_distribution = hl.agg.counter(mt.GT),
)
print(mt.call_distribution.collect())

MaleehaZaman · August 31, 2022, 7:15am

[frozendict({Call(alleles=[0, 1], phased=False): 86587, Call(alleles=[1, 1], phased=False): 66925, Call(alleles=[1, 2], phased=False): 397})]

tpoterba · August 31, 2022, 12:35pm

It looks like you have no homozygous reference calls – is that what you expect?

tpoterba · August 31, 2022, 12:36pm

Actually, in a single-sample project VCF, this might be what we expect…

tpoterba · August 31, 2022, 12:36pm

can you paste the same qc results?

MaleehaZaman · September 5, 2022, 6:48am

Yes, because a VCF for whole exome sequencing only contains variants, hom ref calls are removed.

MaleehaZaman · September 5, 2022, 9:45am

2022-09-05 09:16:28 Hail: WARN: cols(): Resulting column table is sorted by ‘col_key’. To preserve matrix table column order, first unkey columns with ‘key_cols_by()’ 2022-09-05 09:43:49 Hail: INFO: Coerced sorted dataset 2022-09-05 09:44:10 Hail: INFO: Coerced sorted dataset 2022-09-05 09:44:11 Hail: INFO: Coerced sorted dataset


s	n_snp	n_deletion	call_rate	mean	n_insertion	r_insertion_deletion	r_het_hom_var	r_ti_tv
str	int64	int64	float64	float64	int64	float64	float64	float64
“N”	191833	15125	1.00e+00	2.76e+01	13240	8.75e-01	1.30e+00	2.28e+00

MaleehaZaman · September 5, 2022, 9:48am

And this is subset of summarize_variants:
2022-09-05 09:44:36 Hail: INFO: Coerced sorted dataset

Variant summary:

Total variants: 153909
Alleles per variant:

Number of alleles	Count
2	153512
3	397

Counts by allele type:

Allele type	Count
SNP	133411
Deletion	10960
Insertion	9201
MNP	734

Transitions/Transversions:

Metric	Value
Transitions	92822
Transversions	40589
Ratio	2.29

tpoterba · September 6, 2022, 11:43am

This all looks good to me.

That’s the explanation for why the n_snp count is higher than the number of loci in your dataset. Lots of sites with homozygous alternate (1/1) calls drives this up.

MaleehaZaman · September 7, 2022, 5:18am

So it basically counts each homozygous snp as 2? Alright, thanks a lot for clearing this out. So for the kind of output I need I should use summarize variants or something similar.

Thank you.

Maleeha

Topic		Replies	Views
Inconsistent per sample QC result Hail Query & hailctl	3	397	March 15, 2022
Mismatch between Hail and GATK/Picard counting? Hail Query & hailctl	1	263	April 25, 2022
Exporting Hail MT to VCF - Missing Genotypes Hail Query & hailctl	11	252	May 8, 2024
Error in variant/sample QC. Malformed VCF? Hail Query & hailctl	2	375	March 23, 2021
Inconsistent sample qc results Hail Query & hailctl	4	420	April 22, 2020

Hail sample_qc results

Related topics