Hl.experimental.pc_project and the alleles key

I’ve questions about what the pc_project function does.
It seems that it expects both its first and second argument, a matrix table mt that is being projected and a hail table pc_loadings_ht, to be keyed by [‘locus’, ‘alleles’].
To what extend is the pc_project function flexible when it comes to pojecting a call from mt that belongs to a variant row_0 in a situation when the key of row_0 only partly matches a key from pc_loadings_ht?
For example, say a have a homozygous 0/0 call in mt, in a variant whose key is [chr1:1000, [A]], and there is a variant in pc_loadings_ht whose key is [chr1:1000, [T, A]]. Is the call matched with this variant and interpreted as 1/1? What if it the variant in mt were keyed by a) [chr1:1000, [T, A, *], b) [chr1:1000, [A, T], c) [chr1:1000, [T, *] ?
How is missingness of variants from pc_loadings_ht in mt interpreted? As those variants having been called as 0/0?

I hope those question at least make sense. If not, let me know :slight_smile:
Any explanations will be appreciated!

  1. I think the full key (locus and alleles) is used here. However, if you want to join only by locus, you can accomplish that by keying both the matrix table and the loadings table by ‘locus’ to drop the alleles key.

  2. Missingness is treated as being a centered 0, meaning that those loadings won’t contribute to the overall score. The sum of (normed_GT * loading) is divided by the square of the number of variants for each same to compute the PC score, so if some variants are missing, the score will be shrunk towards zero.

Thank You! If yet may dwell on this topic:

I’m actually specifically interested in projecting my samples (along with samples from 1kg genomes) on gs://gnomad-public/release/2.1/pca/gnomad.r2.1.pca_loadings.ht.

  1. The thing is, the above loadings are given for specific biallelic variants keyed by [locus, alleles] (which makes sense of course). So, if I understand correctly, it is entirely up to me to worry about translating genomes in my matrix table into those same variants, in case my ‘alleles’ are different in a given position.

a) Would it then be accurate to say that a missing genotype (in the matrix that is being projected) is equivalent to an average genotype in that position from the population that was used in constructing the loadings?
b) What is normed_GT?
c) When You say: “The sum […] is divided by the square of the number of variants for each sample …”, You mean the overall number of variants in the loadings table, regardless of missingness in my samples?

  1. I think this is spot on, yes. If your alleles are flipped, for instance, joining by locus would be incorrect – the alleles (and calls) should be flipped.

  2. .
    a. Yeah, that’s right, to my understanding.
    b. here’s the computation. We need to mean-center the genotype (the centered thing is normed_gt) before projecting it onto the loadings. I think the correct thing here is probably to use the allele frequency in the dataset used to compute loadings, or the projection might be slightly off.
    c. Yeah, that’s right. See the divisor on line 61 of the linked code above – n_variants is the number of variants in the loadings table, regardless of how many matched variants are found in your matrix table.

1 Like

Many thanks!

As a complete sidenote: I’m a bit buffled by (1 - mt._af) standing in the norm on line 61. Is there maybe some short handwavy explanation for it?

This is the hardy-weinberg variance term, we’re dividing by 2 * p * q there, I believe.

cc @konradjk to confirm?

1 Like