Error in hl.split_multi_hts() output on UK Biobank data?

Hi all—

I’m trying to analyze UK Biobank data with Hail, and I’m getting output from hl.split_multi_hts(mt) that doesn’t make sense to me. I was looking more carefully at the entries for the variant depicted below and noticed that the mt.variant_qc.AC fields don’t make sense to me.

For the G:A allele, the allele counts are [939595, 5], and for the G:T allele, they are [817215, 122385] (see below for snapshot of the mt.rows().show() output). Shouldn’t the allele count for G be the same in the two rows, or am I missing something?

Best,
Jeremy

Those counts look like what I’d expect if you ran variant_qc after running split_multi_hts. A A/T genotype, after splitting, becomes two genotypes (one at each site): G/A and a G/T.

These AC arrays both sum to the same number which makes sense to me: total number of alleles at an autosomal site should be 2N_{samples}.

If you ran variant_qc before split_multi, then you could construct ACs that look the way you want:

# ... variant_qc
# ... split_multi_hts
mt = mt.annotate_rows(AC = [mt.AC[0], mt.AC[mt.a_index]])

Thank you, Dan! I think I got confused because it seems that after splitting, for example, for the G/A genotype, the number of reference (G) alleles was actually the sum of the number of (G) alleles + the other alternate (T) alleles. This way it could maintain the total number of alleles (AN) and allele frequency (AF) for the A allele.

The way you suggest to annotate the rows makes sense!

Best,
Jeremy

1 Like