I am trying to do something like below. This is how my mental model looks like.
The input files have other usual VCF columns such as
FORMAT fields as well. It’s not shown here. The numbers shown in right hand side output table
variant_qc metrics are dummy. But I expect the output to be in that format.
I had the below approach to solve this
a) Read one Input file first
b) Group by [variant, chr position] and compute the metrics such as
n_hom_alt (no need of sample ids as we group by variant and chr position). Should I just use
hl.variant_qc method here?
c) Store the output in a
d) Repeat steps a and b for
e) combine the output of d with step c.
I have few questions
a) Is it possible to achieve my expected output using
hail? How can I do this?
b) How can I compute those measures by grouping variants and chr position? What aggregate function should I use to get those measures? I understand I can follow this link to use group by rows (variant and chr position)?
c) I tried
hl.variant_qc method on my VCF file_1 which gave me the below output for count command
But I also found the same count initially when I read the VCF file
d) Why isn’t there any reduction in size of the matrix table? I was expecting
variant_qc output stored in a
mt to be less in size when compared to the input
mt. Wouldn’t variant_qc generate statistics based on each variant and chromosome position? I didn’t see the sample Ids in the variant_qc, so I was of the understanding that it is grouped by variant and chromosome position. May I know how does that work?
If these VCFs are both for the same samples, but just represent different chromosomes or something, you can do:
path_1 = "path/to/file1.vcf"
path_2 = "path/to/file2.vcf"
mt = hl.import_vcf([path_1, path_2])
and it’ll import both at once.
You shouldn’t need to use grouping methods, as VCFs don’t look the way you described. A matrix table already has one row per locus/allele pair. Each sample gets its own column.
count just tells you the number of rows in a
variant_qc won’t change this.
@johnc1231 - So, importing multiple files is like combining all VCF files together?
Yeah. You should read this if you haven’t: https://samtools.github.io/hts-specs/VCFv4.2.pdf
VCF’s don’t look the way you drew them in your Excel table. They have all the samples in one row.
Actually what I am trying to achieve is like as shown below
I would like to group the records by
locus and compute measures such as
n_hom_ref etc. So, I tried the below
result_mt = (mt.group_rows_by(mt.locus,mt.alleles)
n_het = hl.agg.count_where(mt['GT'].is_het())))
However, it produced an incorrect output (due to my code) like below (which isn’t expected). I also tried using
agg.counter instead of
count.where() but still incorrect output.
I expect my output to be like this (at the population level). I don’t want sample level data.
The above metric numbers shown in table are just dummy and doesn’t carry any meaning
May I know whats the how can I transform this the right way?
I just found out it can be done using
annotate_rows. Hence wrote the below code to get the output shown above. If its incorrect or any other better way to write this, please do let me know. I am posting it here for the benefit of others
MT2 = mt.annotate_rows(n_hom_ref=hl.agg.count_where(mt.GT.is_hom_ref()),
n_het = hl.agg.count_where(mt.GT.is_het()),
n_hom_alt = hl.agg.count_where(mt.GT.is_hom_var()))