Is there an easy way to compare my own SNP dataset with the 1000 Genomes dataset in order to determine which ancestral population my samples belong to? Does it involve merging the 1000 Genomes dataset with my own dataset, or would it be done a different way in Hail?
Let’s break this into separate steps.
If you have a variant-sample matrix (or any other dataset, such as RNA expression data), you can run a principal components analysis which factors your data into variant loadings, sample scores, and singular values. The variant loadings project your data into a low dimensional space (corresponding to the number of principal components which is a parameter you control). The sample scores are the locations of your samples in the same low dimensional space.
If you have variant loadings (whether from PCA or some other dimensionality reduction technique), you can “project” (or “score”) your samples. The projection sends your samples from genome-space to the lower dimensional space.
If you have some points in low dimensional space, you can cluster those points into groups.
If you have clusters, you can assign points to clusters.
If you want to run PCA in hail, you can use
hl.pca. However, it’s important to prepare your data properly for PCA. Hail provides a function that includes some of the necessary preparation for genetics data:
If you have variant loadings, you can project your dataset using the gnomAD project’s
Hail doesn’t have native functionality for clustering, but you can use existing Python clustering tools on the projected data set because the data is small. You’ll want to
export your data to a file and read it in with
Likewise, you can assign clusters to your samples in the low dimensional space using Python libraries like
I’ll let others comment on whether you can use existing 1KG PCA variant loadings or need to run PCA on a combined dataset including 1KG and your dataset.