Compute QC metrics for each variant and position

SSMK · September 10, 2020, 11:39am

Hello Everyone,

I am trying to do something like below. This is how my mental model looks like.

The input files have other usual VCF columns such as INFO and FORMAT fields as well. It’s not shown here. The numbers shown in right hand side output table variant_qc metrics are dummy. But I expect the output to be in that format.

I had the below approach to solve this

a) Read one Input file first
b) Group by [variant, chr position] and compute the metrics such as n_het, n_home_ref,n_hom_alt (no need of sample ids as we group by variant and chr position). Should I just use hl.variant_qc method here?
c) Store the output in a mt
d) Repeat steps a and b for file 2
e) combine the output of d with step c.

I have few questions

a) Is it possible to achieve my expected output using hail? How can I do this?

b) How can I compute those measures by grouping variants and chr position? What aggregate function should I use to get those measures? I understand I can follow this link to use group by rows (variant and chr position)?

c) I tried hl.variant_qc method on my VCF file_1 which gave me the below output for count command

(2954429, 1058)

But I also found the same count initially when I read the VCF file

(2954429, 1058)

d) Why isn’t there any reduction in size of the matrix table? I was expecting variant_qc output stored in a mt to be less in size when compared to the input mt. Wouldn’t variant_qc generate statistics based on each variant and chromosome position? I didn’t see the sample Ids in the variant_qc, so I was of the understanding that it is grouped by variant and chromosome position. May I know how does that work?

johnc1231 · September 10, 2020, 1:55pm

If these VCFs are both for the same samples, but just represent different chromosomes or something, you can do:

path_1 = "path/to/file1.vcf"
path_2 = "path/to/file2.vcf"
mt = hl.import_vcf([path_1, path_2])

and it’ll import both at once.

You shouldn’t need to use grouping methods, as VCFs don’t look the way you described. A matrix table already has one row per locus/allele pair. Each sample gets its own column.

count just tells you the number of rows in a MatrixTable. variant_qc won’t change this.

SSMK · September 10, 2020, 1:59pm

@johnc1231 - So, importing multiple files is like combining all VCF files together?

johnc1231 · September 10, 2020, 2:01pm

Yeah. You should read this if you haven’t: https://samtools.github.io/hts-specs/VCFv4.2.pdf

VCF’s don’t look the way you drew them in your Excel table. They have all the samples in one row.

SSMK · September 11, 2020, 10:21am

Hi @johnc1231,

Actually what I am trying to achieve is like as shown below

I would like to group the records by variant and locus and compute measures such as n_het, n_hom_ref etc. So, I tried the below

result_mt = (mt.group_rows_by(mt.locus,mt.alleles)
               .aggregate(n_hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
                            n_het = hl.agg.count_where(mt['GT'].is_het())))

However, it produced an incorrect output (due to my code) like below (which isn’t expected). I also tried using agg.counter instead of count.where() but still incorrect output.

I expect my output to be like this (at the population level). I don’t want sample level data.

Locus	Alleles	n_hom_ref	n_het
2:1234	[A,C]	1012	345
3:1234	[T,TA]	980	654
4:1234	[C,G]	321	789
5:1234	[T,G]	456	432

The above metric numbers shown in table are just dummy and doesn’t carry any meaning

May I know whats the how can I transform this the right way?

SSMK · September 11, 2020, 3:21pm

I just found out it can be done using annotate_rows. Hence wrote the below code to get the output shown above. If its incorrect or any other better way to write this, please do let me know. I am posting it here for the benefit of others

MT2 = mt.annotate_rows(n_hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
                       n_het = hl.agg.count_where(mt.GT.is_het()),
                       n_hom_alt = hl.agg.count_where(mt.GT.is_hom_var()))

Topic		Replies	Views
Hail sample_qc results Hail Query & hailctl	15	451	September 7, 2022
Appending/merging/combining multiple VCF files using Hail Hail Query & hailctl	2	1391	September 8, 2020
Sample_qc help on depth mean and genotype quality Help [0.1]	2	730	February 19, 2018
Sample QC metrics Help [0.1]	6	1325	April 6, 2017
Unable to do sample/variant QC after combining MatrixTable Hail Query & hailctl	11	418	January 8, 2023

Compute QC metrics for each variant and position

Related topics