 # Compute QC metrics for each variant and position

Hello Everyone,

I am trying to do something like below. This is how my mental model looks like.

The input files have other usual VCF columns such as `INFO` and `FORMAT` fields as well. It’s not shown here. The numbers shown in right hand side output table `variant_qc` metrics are dummy. But I expect the output to be in that format.

I had the below approach to solve this

a) Read one Input file first
b) Group by [variant, chr position] and compute the metrics such as `n_het`, `n_home_ref`,`n_hom_alt` (no need of sample ids as we group by variant and chr position). Should I just use `hl.variant_qc` method here?
c) Store the output in a `mt`
d) Repeat steps a and b for `file 2`
e) combine the output of d with step c.

I have few questions

a) Is it possible to achieve my expected output using `hail`? How can I do this?

b) How can I compute those measures by grouping variants and chr position? What aggregate function should I use to get those measures? I understand I can follow this link to use group by rows (variant and chr position)?

c) I tried `hl.variant_qc` method on my VCF file_1 which gave me the below output for count command

`(2954429, 1058)`

But I also found the same count initially when I read the VCF file

`(2954429, 1058)`

d) Why isn’t there any reduction in size of the matrix table? I was expecting `variant_qc` output stored in a `mt` to be less in size when compared to the input `mt`. Wouldn’t variant_qc generate statistics based on each variant and chromosome position? I didn’t see the sample Ids in the variant_qc, so I was of the understanding that it is grouped by variant and chromosome position. May I know how does that work?

If these VCFs are both for the same samples, but just represent different chromosomes or something, you can do:

``````path_1 = "path/to/file1.vcf"
path_2 = "path/to/file2.vcf"
mt = hl.import_vcf([path_1, path_2])
``````

and it’ll import both at once.

You shouldn’t need to use grouping methods, as VCFs don’t look the way you described. A matrix table already has one row per locus/allele pair. Each sample gets its own column.

`count` just tells you the number of rows in a `MatrixTable`. `variant_qc` won’t change this.

1 Like

@johnc1231 - So, importing multiple files is like combining all VCF files together?

Yeah. You should read this if you haven’t: https://samtools.github.io/hts-specs/VCFv4.2.pdf

VCF’s don’t look the way you drew them in your Excel table. They have all the samples in one row.

Hi @johnc1231,

Actually what I am trying to achieve is like as shown below

I would like to group the records by `variant` and `locus` and compute measures such as `n_het`, `n_hom_ref` etc. So, I tried the below

``````result_mt = (mt.group_rows_by(mt.locus,mt.alleles)
.aggregate(n_hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
n_het = hl.agg.count_where(mt['GT'].is_het())))
``````

However, it produced an incorrect output (due to my code) like below (which isn’t expected). I also tried using `agg.counter` instead of `count.where()` but still incorrect output.

I expect my output to be like this (at the population level). I don’t want sample level data.

Locus Alleles n_hom_ref n_het
2:1234 [A,C] 1012 345
3:1234 [T,TA] 980 654
4:1234 [C,G] 321 789
5:1234 [T,G] 456 432

The above metric numbers shown in table are just dummy and doesn’t carry any meaning

May I know whats the how can I transform this the right way?

I just found out it can be done using `annotate_rows`. Hence wrote the below code to get the output shown above. If its incorrect or any other better way to write this, please do let me know. I am posting it here for the benefit of others

``````MT2 = mt.annotate_rows(n_hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
n_het = hl.agg.count_where(mt.GT.is_het()),
n_hom_alt = hl.agg.count_where(mt.GT.is_hom_var()))``````