Ancestry inference in Hail

enriquea · February 11, 2018, 4:55pm

Hi Hail team,

After running PCA in Hail, I see clearly four major groups when plotting the first few PCs. Is there available some tutorial/material describing how to assign ancestry using PCA in Hail? I have really few samples with ancestry annotations.

How can I predict the unknown ones? Do I need train a classifier (i.e. RF) using external data to do it?

Thanks

klong · February 11, 2018, 5:51pm

There are only four major groups. What you will get with more data is a smearing of ethnicity based on the variants groups that you are looking at. Hapmap data is a good reference/classifier to use depending on the variants that you are using.

ncbi.nlm.nih.gov

A comparison of DMET Plus microarray and genome-wide technologies by assessing population substructure.

JN Jackson, KM Long, Y He, AA Motsinger-Reif, HL McLeod and J Jack, Pharmacogenetics and genomics, Jan 2016 01

The capacity of the Affymetrix drug metabolism enzymes and transporters (DMET) Plus pharmacogenomics genotyping chip to estimate population substructure and cryptic relatedness was evaluated. The results were compared with estimates using genome-wide HapMap data for the same individuals.For 301 unrelated individuals, spanning three continental populations and one admixed population, genotypic data were collected using the Affymetrix DMET Plus microarray. Genome-wide data on these individuals were obtained from HapMap release 3. Population substructure was assessed using Eigenstrat and ADMIXTURE software for both platforms. Cryptic relatedness was explored by inbreeding coefficient estimation. Nonparametric tests were used to determine correlations of the analytical results of the two genotyping platforms.Principal components analysis identified population substructure for both datasets, with 15.8 and 16.6% of the total variance explained in the first two principal components for DMET Plus and HapMap data, respectively. ADMIXTURE results correctly identified four subpopulations within each dataset. Nonparametric rank correlations indicated significant associations between analyses with an average ρ=0.7272 (P<10) across the three continental populations and ρ=0.4888 for the admixed population. Concordance correlation coefficients (average ρc=0.9693 across all four subpopulations) strongly indicate concordance between ADMIXTURE results. Inbreeding coefficients were slightly inflated (16 individuals>0.15) using DMET Plus data and no cryptic relatedness was indicated using HapMap data. The inflated inbreeding estimation could be because of the limited number of markers provided by DMET as a random sample of 1832 markers from HapMap also yielded inflated estimates of cryptic relatedness (39 individuals>0.15). Furthermore, use of single nucleotide polymorphisms located in genes involved in metabolism and transport may have different allele frequencies in subpopulations than single nucleotide polymorphisms sampled from the whole genome.The DMET Plus pharmacogenomics genotyping chip is effective in quantifying population substructure across the three continental populations and inferring the presence of an admixed population. On the basis of our results, these microarrays offer sufficient depth for covariate adjustment of population substructure in genomic association studies.

Smearing comes with Admix groups like Brazilans. My buddy Gui can talk for days on the subject.

https://www.ncbi.nlm.nih.gov/pubmed/?term=Suarez-Kurtz%20G[Author]&cauthor=true&cauthor_uid=26265036

tpoterba · February 12, 2018, 7:00pm

A kmeans / RF classifier would be a great addition. We don’t have anything like this right now, unfortunately, though I imagine it would be possible to take a samples table to pandas / spark, use their tools, and annotate back.

enriquea · February 12, 2018, 7:38pm

Hi @klong and @tpoterba,

Many thanks for your replies. Are you planning to include this feature in the next Hail’s release (@tpoterba )?

I was wondering if it is possible to ‘interpolate’ the data with the HapMap (used as reference) (both in the format: sample, PC1, PC2, …). Does it make sense? Any suggestion?

Thanks in advance,

tpoterba · February 12, 2018, 7:42pm

We plan to develop on the 0.2 version for a while, so while we definitely won’t have this at the beginning, it’s totally possible we’ll have something like this at some point during the life of 0.2.

I don’t see it being easy to impute ancestry from HapMap / 1KG PCs. You’d need the exact same set of variants used for your samples and the external samples and the same MAF – this doesn’t seem possible.

My guess is that the easiest thing would be to add 1kg samples to your dataset, run PCA on everything, and use the reference samples to define ancestry clusters for kmeans.

enriquea · February 26, 2018, 12:30pm

Hi @tpoterba,

Many thanks for your reply and suggestions…I did it and I am getting good results now. In the last step, rather than run k-means I have trained a RF classifier using the first PCs to define/predict ancestry. But I guess kmeans might work equally nice!

Thanks again

tpoterba · February 26, 2018, 1:25pm

excellent! If you’re willing to share that code, I’m sure others would appreciate seeing it as well!

enriquea · February 28, 2018, 4:13pm

Hi,

Yes sure. Below the code that I’m using to do this analysis. It can break down into two main part: i) PCA in Hail and ii) training RF model and prediction in R. Here I’m using a previous approach that we have already developed for a previous work (https://doi.org/10.1371/journal.pone.0189875) which combine recursive feature elimination (here PCs) and RF. Ideally, it should be possible to have everything done in python for easier integration, but it requires extra work that I’m planning to do in the near future.

i) PCA in Hail


# read VDS 1kgenome-phase 3
vds_1kg = hc.read('/path/1000G_phase3/vds/vds_1kg_v3.vds')

# Preparing 1000genome data for PCA
# filtering
vds_filtered_1kg = vds_1kg.filter_multi() # filter out multi-allelic variants.
vds_filtered_1kg = vds_filtered_1kg.filter_variants_expr('v.altAllele().isSNP()', keep=True) # keep only SNPs

# read interval table (Purcell 5k)
purcell5k = KeyTable.import_bed('/path/1000G_phase3/purcell5k_intervals.bed')
vds_filtered_1kg = vds_filtered_1kg.\
                   filter_variants_table(purcell5k.key_by('interval'), keep=True) # keep overlapping variant

# Read target VDS 
vds = hc.read('/path/wes_vcf_merged.vds')

# Filtering
vds = vds.filter_multi() # filter out multi-allelic variants.
vds = vds.filter_variants_expr('v.altAllele().isSNP()', keep=True) # keep only SNPs

# Merge dataset in overlapping SNPs (inner join)
vds_merged = vds.join(vds_filtered_1kg)

# computing variant QC on merged VDS
vds_merged = vds_merged.variant_qc().cache()

# Basic variant filtering before PCA
# Keep common SNPs (AF > 1%)
# LD pruning
common_vds = (vds_merged
              .filter_variants_expr('va.qc.AF > 0.01')
              .ld_prune(memory_per_core=512, num_cores=16))

# Perform PCA on merged VDS (getting 10 first PCs)
vds_pca = common_vds.pca('sa.pca', k=10, eigenvalues='global.eigen')

# Getting sample annotations as Pandas dataframe
pca_table = vds_pca.samples_table().to_pandas()

ii) Training/testing RF classifier and prediction


library(caret)

######################################################################################
# This script could be used for training a Random Forest classifier and predict
# Population ancestry using PCs as features. It has been tested after run PCA on ~5,000
# polymorphic SNPs (Purcell interval) on two 'merged' dataset: 1) 1000Genomes
# (used as reference since it contains sample population annotations) and 2)
# target dataset (actually the dataset which sample population are unknown). Both dataset 
# were merged before running the PCA, thus, PCs will be used as features for training and predicting.
######################################################################################


# dataframe with samples, population and PCs (i.e. 10 first PC) from Hail workflow
pca_wes10k <- pca_table

# getting subset with known population info
training <- subset(pca_wes10k, sa.SuperPopulation %in% c("AFR","AMR","EAS","EUR","SAS"))

# Getting subset with unknown population information
# we are going to predict ancestry on this subset...
discovery <- subset(pca_wes10k, !sa.SuperPopulation %in% c("AFR","AMR","EAS","EUR","SAS"))

# Extract training features (Prinipal components)
features <- as.matrix(training[,c(6:15)])

# Extract response variable (Population info)
class <- as.vector(training$sa.SuperPopulation)

# Warning: if you are going to scale your data,
# do it before split it, otherwise, it will 
# introduce a bias in the prediction phase, since
# you are using 'different' features for training the model
# and predicting new instances!

# Scale data features
# features <- scale(features, center=TRUE, scale=TRUE)

# define some workflow metrics
n <- 10 # folds to repit the entire workflow (on randomized test data)
nVars <- vector() # number of features in the final model
accv <- vector() # prediction accuracy per run

for(i in seq(1,n,1)){
  
  # Divide the dataset in train and test sets
  inTrain <- createDataPartition(as.factor(class), p = 2/3, list = FALSE)
  
  # Create the training dataset
  trainDescr <- features[inTrain,]
  
  # Create the testing dataset
  testDescr <- features[-inTrain,]
  
  # create the training class subset
  trainClass <- class[inTrain]
  
  # create the testing class subset
  testClass <- class[-inTrain]
  
  #### recursive feature elimination plus random forest
  rfProfile <- rfe(x = trainDescr, 
                   y = as.factor(trainClass),
                   maximize = TRUE,
                   metric = 'Accuracy',
                   sizes = c(1:10), 
                   rfeControl = rfeControl(functions = rfFuncs, 
                                           method = "cv",
                                           number = 10,
                                           verbose = TRUE))
  
  
  ## predict variable response (class) for all sammples with the new model
  predictedClass <- predict(rfProfile, newdata = testDescr)
  
  ## compute prediction accuracy
  accTable <- postResample(predictedClass, as.factor(testClass))
  accv[i] <- accTable[['Accuracy']] 
  
  ## keep best model
  if (accv[length(accv)] >= max(accv)){
    bestModel <- rfProfile
  }
  
  ## retrive number of variable used
  nVar <- rfProfile$optsize
  nVars[i] <- nVar
}

## summary metrics
nVar_mean <- mean(nVars)
acc_mean <- mean(accv)
acc_sd <- sd(accv)

# Dataframe with variables metric (i.e. Accuracy) information
results <- bestModel$results

# plotting/saving RFE object
png("performance_vs_variables_RFE_RF.png", width = 800, height = 800)
plot.rfe(x = bestModel, xlab='Number of variables')
dev.off()

# Predict ancestry/population from discovery subset using the 'best classifier'

# Extract features (PCs) from discovery subset
unknown <- as.matrix(discovery[,c(6:15)])

# predict new Populations for unknown instances/samples
predicted.ancestry <- predict(bestModel, newdata = unknown)

# rename columns
names(predicted.ancestry) <- c('predictedPopulation', 
                               'probability_AFR', 
                               'probability_AMR', 
                               'probability_EAS',
                               'probability_EUR',
                               'probability_SAS')

# add predicted Population to original data
# merge here by row index since it remains unchanged.
merged <- merge(discovery, predicted.ancestry, by=0, all = T)

# save file with sample population predicted
write.table(merged, file = 'population_predicted_random_forest.txt', row.names = FALSE, sep = '\t')

Topic		Replies	Views
Ancestry Estimation for many VCFs Hail Query & hailctl	0	19	May 21, 2025
PCA plot for data compared with ancestry reference (1000 Genomes) Hail Query & hailctl	1	1113	October 15, 2019
Big picture issues: considering switching to HAIL Meta	6	3901	January 3, 2023
PCA to output allele frequencies alongside loadings? Feature Requests	7	1331	April 24, 2019
PCA Projection onto existing PCA Hail Query & hailctl	5	466	September 22, 2023

Ancestry inference in Hail

Related topics