I am trying to do something like below. This is how my mental model looks like.
The input files have other usual VCF columns such as
FORMAT fields as well. It’s not shown here. The numbers shown in right hand side output table
variant_qc metrics are dummy. But I expect the output to be in that format.
I had the below approach to solve this
a) Read one Input file first
b) Group by [variant, chr position] and compute the metrics such as
n_hom_alt (no need of sample ids as we group by variant and chr position). Should I just use
hl.variant_qc method here?
c) Store the output in a
d) Repeat steps a and b for
e) combine the output of d with step c.
I have few questions
a) Is it possible to achieve my expected output using
hail? How can I do this?
b) How can I compute those measures by grouping variants and chr position? What aggregate function should I use to get those measures? I understand I can follow this link to use group by rows (variant and chr position)?
c) I tried
hl.variant_qc method on my VCF file_1 which gave me the below output for count command
But I also found the same count initially when I read the VCF file
d) Why isn’t there any reduction in size of the matrix table? I was expecting
variant_qc output stored in a
mt to be less in size when compared to the input
mt. Wouldn’t variant_qc generate statistics based on each variant and chromosome position? I didn’t see the sample Ids in the variant_qc, so I was of the understanding that it is grouped by variant and chromosome position. May I know how does that work?