PCA plot for data compared with ancestry reference (1000 Genomes)

Julie · October 15, 2019, 3:45pm

Is there an easy way to compare my own SNP dataset with the 1000 Genomes dataset in order to determine which ancestral population my samples belong to? Does it involve merging the 1000 Genomes dataset with my own dataset, or would it be done a different way in Hail?

danking · October 15, 2019, 6:29pm

Hi @Julie,

Let’s break this into separate steps.

If you have a variant-sample matrix (or any other dataset, such as RNA expression data), you can run a principal components analysis which factors your data into variant loadings, sample scores, and singular values. The variant loadings project your data into a low dimensional space (corresponding to the number of principal components which is a parameter you control). The sample scores are the locations of your samples in the same low dimensional space.

If you have variant loadings (whether from PCA or some other dimensionality reduction technique), you can “project” (or “score”) your samples. The projection sends your samples from genome-space to the lower dimensional space.

If you have some points in low dimensional space, you can cluster those points into groups.

If you have clusters, you can assign points to clusters.

If you want to run PCA in hail, you can use hl.pca. However, it’s important to prepare your data properly for PCA. Hail provides a function that includes some of the necessary preparation for genetics data: hl.hwe_normalized_pca.

If you have variant loadings, you can project your dataset using the gnomAD project’s pc_project function.

Hail doesn’t have native functionality for clustering, but you can use existing Python clustering tools on the projected data set because the data is small. You’ll want to export your data to a file and read it in with numpy or pandas.

Likewise, you can assign clusters to your samples in the low dimensional space using Python libraries like numpy or pandas.

I’ll let others comment on whether you can use existing 1KG PCA variant loadings or need to run PCA on a combined dataset including 1KG and your dataset.

Topic		Replies	Views
PCA Projection onto existing PCA Hail Query & hailctl	5	466	September 22, 2023
Ancestry inference in Hail Help [0.1]	7	1855	February 28, 2018
Save PCs for projection Feature Requests	5	1487	May 12, 2020
Ancestry Estimation for many VCFs Hail Query & hailctl	0	19	May 21, 2025
PCA to output allele frequencies alongside loadings? Feature Requests	7	1331	April 24, 2019

PCA plot for data compared with ancestry reference (1000 Genomes)

Related topics