I’m trying to analyze UK Biobank data with Hail, and I’m getting output from hl.split_multi_hts(mt) that doesn’t make sense to me. I was looking more carefully at the entries for the variant depicted below and noticed that the mt.variant_qc.AC fields don’t make sense to me.
For the G:A allele, the allele counts are [939595, 5], and for the G:T allele, they are [817215, 122385] (see below for snapshot of the mt.rows().show() output). Shouldn’t the allele count for G be the same in the two rows, or am I missing something?
Those counts look like what I’d expect if you ran variant_qc after running split_multi_hts. A A/T genotype, after splitting, becomes two genotypes (one at each site): G/A and a G/T.
These AC arrays both sum to the same number which makes sense to me: total number of alleles at an autosomal site should be 2N_{samples}.
If you ran variant_qc before split_multi, then you could construct ACs that look the way you want:
Thank you, Dan! I think I got confused because it seems that after splitting, for example, for the G/A genotype, the number of reference (G) alleles was actually the sum of the number of (G) alleles + the other alternate (T) alleles. This way it could maintain the total number of alleles (AN) and allele frequency (AF) for the A allele.
The way you suggest to annotate the rows makes sense!