PCA to output allele frequencies alongside loadings?


Title says most of it. I feel like hwe_normalized_pca could output the frequency it uses to center alongside the loadings. I’m doing this for downstream projection, which presumably is most of the use cases for outputting the loadings in the first place.

Right now my process looks like:

eig, scores, loadings = hl.hwe_normalized_pca(mt)
mt = mt.annotate(af=hl.agg.call_stats(mt.GT, mt.alleles).AF[1])
loadings = loadings.annotate(af=mt[loadings.locus])

But this seems silly and wasteful.


Since you’re likely filtering beforehand based on an AF annotation on the MatrixTable, I think it’s reasonable to use the generic PCA functionality with the HWE normalizing expression, so that the AF need not be recalculated, and then to annotate AF on the loadings from the rows of the MatrixTable. You’d modify this expression from the Python code:

    entry_expr = hl.bind(
        dataset.AC / dataset.n_called,
        lambda mean_gt: hl.cond(hl.is_defined(dataset.GT),
                                (dataset.GT.n_alt_alleles() - mean_gt) /
                                hl.sqrt(mean_gt * (2 - mean_gt) * n_variants / 2),

I’m less sure how the latter join will perform.


Or if this is really a common use case after running hwe_normalized_pca, we could consider adding AF to loadings as well, but only for that specialized PCA function.