The input files have other usual VCF columns such as INFO and FORMAT fields as well. It’s not shown here. The numbers shown in right hand side output table variant_qc metrics are dummy. But I expect the output to be in that format.
I had the below approach to solve this
a) Read one Input file first
b) Group by [variant, chr position] and compute the metrics such as n_het, n_home_ref,n_hom_alt (no need of sample ids as we group by variant and chr position). Should I just use hl.variant_qc method here?
c) Store the output in a mt
d) Repeat steps a and b for file 2
e) combine the output of d with step c.
I have few questions
a) Is it possible to achieve my expected output using hail? How can I do this?
b) How can I compute those measures by grouping variants and chr position? What aggregate function should I use to get those measures? I understand I can follow this link to use group by rows (variant and chr position)?
c) I tried hl.variant_qc method on my VCF file_1 which gave me the below output for count command
(2954429, 1058)
But I also found the same count initially when I read the VCF file
(2954429, 1058)
d) Why isn’t there any reduction in size of the matrix table? I was expecting variant_qc output stored in a mt to be less in size when compared to the input mt. Wouldn’t variant_qc generate statistics based on each variant and chromosome position? I didn’t see the sample Ids in the variant_qc, so I was of the understanding that it is grouped by variant and chromosome position. May I know how does that work?
You shouldn’t need to use grouping methods, as VCFs don’t look the way you described. A matrix table already has one row per locus/allele pair. Each sample gets its own column.
count just tells you the number of rows in a MatrixTable. variant_qc won’t change this.
However, it produced an incorrect output (due to my code) like below (which isn’t expected). I also tried using agg.counter instead of count.where() but still incorrect output.
I just found out it can be done using annotate_rows. Hence wrote the below code to get the output shown above. If its incorrect or any other better way to write this, please do let me know. I am posting it here for the benefit of others