I have a srWGS (41 million SNPs) and lrWGS (6 million SNPs) Hail MT. The lrWGS MT SNPs are a subset of the srWGS SNPs (same variant sites, but with slightly different GT calls). The samples are the same ID between both.
I am looking to generate a PCA of the pruned srWGS SNPs, and then generate a PCA of the pruned lrWGS SNPs, and project the new points onto the existing srWGS dataset. Essentially, this would show how “different” or impactful the missing SNPs are in the lrWGS datset as compared to the srWGS dataset. Is this doable with Hail’s built in functions?
If I merge the two tables together (rename lrWGS sample IDs), then generate a single PCA, would this yield the same result as above?
Thank you.
Hi @soisa001,
Normally we advise to only perform PCA using common variants, which I believe is standard practice (I’m not a scientist). But it seems you want to verify that practice?
Hail’s PCA method returns all the information you should need to compute some measure of “difference” of the two PCAs, but how to do that would depend on what measure you want to use. Did you have an idea how you want to do that?
If I merge the two tables together (rename lrWGS sample IDs), then generate a single PCA, would this yield the same result as above?
This would definitely be different than either two datasets alone. I think this is probably not what you want to do.
Also be aware tha PCA with 41 million SNPs would be an absolutely massive computation. Hail has a scalable PCA method which should be able to do it, but I’m not sure if a PCA of that scale has been attempted before, or how expensive it would be.
Hi Patrick,
Thanks for the quick response. I was looking to do something similar to https://github.com/DReichLab/EIG/blob/master/POPGEN/lsqproject.pdf
where you could consider the srWGS as the complete dataset, and the lrWGS as like an archaic dataset (lrWGS is low coverage and has a subset of variants of the srWGS).
First I would generate the srWGS PCA and then “project” the lrWGS samples onto that.
Would this be possible in Hail?
I would probably downsample variants as well before creating the PCA, so that it’s not so large.
Thanks
Thanks for the reference, I think I understand now. To make sure: you would downsample to common variants, but even then, some of those variants would be missing (at least in some samples) in lrWGS, and you want to project the lrWGS samples into the PC space of srWGS, using the method in the pdf you linked rather than mean imputing the missing variants. Is that right?
Yes, that’s the general gist of it. We are also interested in computing the srWGS space with the complete variant set (i.e not downsampling), then projecting the lrWGS onto it.
It sounds like the current hail method imputes missing information?