Understanding how hwe_normalized_pca works

Guntaas · March 20, 2024, 10:18pm

Hi,

I’m wondering if anyone could please share or point me in the right direction regarding understanding how hwe_normalized_pca works? I’ve used it alongside my own method of preparing data for a PCA and wanted to understand it in greater detail, to better interpret the clustering I get with hwe_normalized_pca. I’m particularly interested in knowing:

how hwe_normalized_pca converts genotype calls such as “0/0”, “1/0”, etc into numeric values that can be inputted into a pca?
What quality control thresholds it sets for rows (variants) and individuals (columns)?
How it imputes missing values?

I processed my data for hl.pca, by setting all GT calls with a GQ value below 20 to NA, and removing rows where over 10% of the GT calls are NA. I converted GT calls into numbers, such as 0/0 → 0 , 1/0 or 0/1 → 10, 1/1 → 11, 1/2 or 2/1 → 21, 9/15 → 159, etc. - essentially appending the left side of the “/” with the right side of the “/” with the bigger number coming first each time. Finally, I set the missing values in each row (variant) to the mean of the defined values in each row. Of course, I centered the GT entry (now numeric) data of each variant using the mean of entries for that variant and standardized the data by dividing each entry for a specific variant with the standard deviation of the entries for that specific variant. I’m really curious to how this process differs to hwe_normalized_pca? Thank You!

patrick-schultz · March 25, 2024, 1:31pm

Hi @Guntaas,

Have you looked at the docs for hwe_normalized_pca? They cover 1. and 2. pretty well. In summary:

It normalizes the count of alternate alleles, but instead of using the empirical variance, it computes the empirical mean (which is twice the allele frequency), and uses the variance according to the Hardy-Weinberg equilibrium model. This is the standard practice for PCA on genotypic data.
Basically none. It filters out monomorphic sites, as they can’t be normalized, but otherwise you should do all your QC before running PCA.
It mean imputes missing genotypes. Since the genotypes have been normalized, that means it replaces missing genotypes with 0.

For complete details, see the method hwe_normalize here. It’s pretty simple.

Guntaas · March 26, 2024, 5:00pm

Thank you!

Topic		Replies	Views
PCA to output allele frequencies alongside loadings? Feature Requests	7	1291	April 24, 2019
PCA - proportion of variance explained Hail Query & hailctl	1	442	September 22, 2021
PCA Projection onto existing PCA Hail Query & hailctl	5	297	September 22, 2023
Linear regression explanation Hail Query & hailctl	7	190	June 27, 2023
Change distance metric in PCA Feature Requests	5	882	April 23, 2019

Understanding how hwe_normalized_pca works

Related Topics