Compute QC metrics for each variant and position

Hello Everyone,

I am trying to do something like below. This is how my mental model looks like.

The input files have other usual VCF columns such as INFO and FORMAT fields as well. It’s not shown here. The numbers shown in right hand side output table variant_qc metrics are dummy. But I expect the output to be in that format.

I had the below approach to solve this

a) Read one Input file first
b) Group by [variant, chr position] and compute the metrics such as n_het, n_home_ref,n_hom_alt (no need of sample ids as we group by variant and chr position). Should I just use hl.variant_qc method here?
c) Store the output in a mt
d) Repeat steps a and b for file 2
e) combine the output of d with step c.

I have few questions

a) Is it possible to achieve my expected output using hail? How can I do this?

b) How can I compute those measures by grouping variants and chr position? What aggregate function should I use to get those measures? I understand I can follow this link to use group by rows (variant and chr position)?

c) I tried hl.variant_qc method on my VCF file_1 which gave me the below output for count command

(2954429, 1058)

But I also found the same count initially when I read the VCF file

(2954429, 1058)

d) Why isn’t there any reduction in size of the matrix table? I was expecting variant_qc output stored in a mt to be less in size when compared to the input mt. Wouldn’t variant_qc generate statistics based on each variant and chromosome position? I didn’t see the sample Ids in the variant_qc, so I was of the understanding that it is grouped by variant and chromosome position. May I know how does that work?

If these VCFs are both for the same samples, but just represent different chromosomes or something, you can do:

path_1 = "path/to/file1.vcf"
path_2 = "path/to/file2.vcf"
mt = hl.import_vcf([path_1, path_2])

and it’ll import both at once.

You shouldn’t need to use grouping methods, as VCFs don’t look the way you described. A matrix table already has one row per locus/allele pair. Each sample gets its own column.

count just tells you the number of rows in a MatrixTable. variant_qc won’t change this.

1 Like

@johnc1231 - So, importing multiple files is like combining all VCF files together?

Yeah. You should read this if you haven’t: https://samtools.github.io/hts-specs/VCFv4.2.pdf

VCF’s don’t look the way you drew them in your Excel table. They have all the samples in one row.

Hi @johnc1231,

Actually what I am trying to achieve is like as shown below

I would like to group the records by variant and locus and compute measures such as n_het, n_hom_ref etc. So, I tried the below

result_mt = (mt.group_rows_by(mt.locus,mt.alleles)
               .aggregate(n_hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
                            n_het = hl.agg.count_where(mt['GT'].is_het()))) 

However, it produced an incorrect output (due to my code) like below (which isn’t expected). I also tried using agg.counter instead of count.where() but still incorrect output.

I expect my output to be like this (at the population level). I don’t want sample level data.

Locus Alleles n_hom_ref n_het
2:1234 [A,C] 1012 345
3:1234 [T,TA] 980 654
4:1234 [C,G] 321 789
5:1234 [T,G] 456 432

The above metric numbers shown in table are just dummy and doesn’t carry any meaning

May I know whats the how can I transform this the right way?

I just found out it can be done using annotate_rows. Hence wrote the below code to get the output shown above. If its incorrect or any other better way to write this, please do let me know. I am posting it here for the benefit of others

MT2 = mt.annotate_rows(n_hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
                       n_het = hl.agg.count_where(mt.GT.is_het()),
                       n_hom_alt = hl.agg.count_where(mt.GT.is_hom_var()))