Hello!
I’ve questions about what the pc_project
function does.
It seems that it expects both its first and second argument, a matrix table mt
that is being projected and a hail table pc_loadings_ht
, to be keyed by [‘locus’, ‘alleles’].
1.
To what extend is the pc_project
function flexible when it comes to pojecting a call from mt
that belongs to a variant row_0
in a situation when the key of row_0
only partly matches a key from pc_loadings_ht
?
For example, say a have a homozygous 0/0 call in mt
, in a variant whose key is [chr1:1000, [A]]
, and there is a variant in pc_loadings_ht
whose key is [chr1:1000, [T, A]]
. Is the call matched with this variant and interpreted as 1/1? What if it the variant in mt
were keyed by a) [chr1:1000, [T, A, *]
, b) [chr1:1000, [A, T]
, c) [chr1:1000, [T, *]
?
2.
How is missingness of variants from pc_loadings_ht
in mt
interpreted? As those variants having been called as 0/0?
I hope those question at least make sense. If not, let me know 
Any explanations will be appreciated!
Thank You! If yet may dwell on this topic:
I’m actually specifically interested in projecting my samples (along with samples from 1kg genomes) on gs://gnomad-public/release/2.1/pca/gnomad.r2.1.pca_loadings.ht
.
-
The thing is, the above loadings are given for specific biallelic variants keyed by [locus, alleles] (which makes sense of course). So, if I understand correctly, it is entirely up to me to worry about translating genomes in my matrix table into those same variants, in case my ‘alleles’ are different in a given position.
-
a) Would it then be accurate to say that a missing genotype (in the matrix that is being projected) is equivalent to an average genotype in that position from the population that was used in constructing the loadings?
b) What is normed_GT
?
c) When You say: “The sum […] is divided by the square of the number of variants for each sample …”, You mean the overall number of variants in the loadings table, regardless of missingness in my samples?
Many thanks!
As a complete sidenote: I’m a bit buffled by (1 - mt._af)
standing in the norm on line 61. Is there maybe some short handwavy explanation for it?
This is the hardy-weinberg variance term, we’re dividing by 2 * p * q
there, I believe.
cc @konradjk to confirm?
1 Like