PCA Projection onto existing PCA

soisa001 · September 13, 2023, 8:38pm

I have a srWGS (41 million SNPs) and lrWGS (6 million SNPs) Hail MT. The lrWGS MT SNPs are a subset of the srWGS SNPs (same variant sites, but with slightly different GT calls). The samples are the same ID between both.

I am looking to generate a PCA of the pruned srWGS SNPs, and then generate a PCA of the pruned lrWGS SNPs, and project the new points onto the existing srWGS dataset. Essentially, this would show how “different” or impactful the missing SNPs are in the lrWGS datset as compared to the srWGS dataset. Is this doable with Hail’s built in functions?

If I merge the two tables together (rename lrWGS sample IDs), then generate a single PCA, would this yield the same result as above?

Thank you.

patrick-schultz · September 21, 2023, 1:19pm

Hi @soisa001,
Normally we advise to only perform PCA using common variants, which I believe is standard practice (I’m not a scientist). But it seems you want to verify that practice?

Hail’s PCA method returns all the information you should need to compute some measure of “difference” of the two PCAs, but how to do that would depend on what measure you want to use. Did you have an idea how you want to do that?

If I merge the two tables together (rename lrWGS sample IDs), then generate a single PCA, would this yield the same result as above?

This would definitely be different than either two datasets alone. I think this is probably not what you want to do.

patrick-schultz · September 21, 2023, 1:28pm

Also be aware tha PCA with 41 million SNPs would be an absolutely massive computation. Hail has a scalable PCA method which should be able to do it, but I’m not sure if a PCA of that scale has been attempted before, or how expensive it would be.

soisa001 · September 21, 2023, 1:44pm

Hi Patrick,

Thanks for the quick response. I was looking to do something similar to https://github.com/DReichLab/EIG/blob/master/POPGEN/lsqproject.pdf
where you could consider the srWGS as the complete dataset, and the lrWGS as like an archaic dataset (lrWGS is low coverage and has a subset of variants of the srWGS).
First I would generate the srWGS PCA and then “project” the lrWGS samples onto that.

Would this be possible in Hail?
I would probably downsample variants as well before creating the PCA, so that it’s not so large.

Thanks

patrick-schultz · September 21, 2023, 3:07pm

Thanks for the reference, I think I understand now. To make sure: you would downsample to common variants, but even then, some of those variants would be missing (at least in some samples) in lrWGS, and you want to project the lrWGS samples into the PC space of srWGS, using the method in the pdf you linked rather than mean imputing the missing variants. Is that right?

soisa001 · September 22, 2023, 6:02pm

Yes, that’s the general gist of it. We are also interested in computing the srWGS space with the complete variant set (i.e not downsampling), then projecting the lrWGS onto it.
It sounds like the current hail method imputes missing information?

Topic		Replies	Views
PCA plot for data compared with ancestry reference (1000 Genomes) Hail Query & hailctl	1	1142	October 15, 2019
Save PCs for projection Feature Requests	5	1503	May 12, 2020
PCA to output allele frequencies alongside loadings? Feature Requests	7	1337	April 24, 2019
Ancestry inference in Hail Help [0.1]	7	1873	February 28, 2018
Independent SNPS in PCA Hail Query & hailctl	1	348	July 30, 2021

PCA Projection onto existing PCA

Related topics