Understanding how hwe_normalized_pca works

Guntaas · March 20, 2024, 10:18pm

Hi,

I’m wondering if anyone could please share or point me in the right direction regarding understanding how hwe_normalized_pca works? I’ve used it alongside my own method of preparing data for a PCA and wanted to understand it in greater detail, to better interpret the clustering I get with hwe_normalized_pca. I’m particularly interested in knowing:

how hwe_normalized_pca converts genotype calls such as “0/0”, “1/0”, etc into numeric values that can be inputted into a pca?
What quality control thresholds it sets for rows (variants) and individuals (columns)?
How it imputes missing values?

I processed my data for hl.pca, by setting all GT calls with a GQ value below 20 to NA, and removing rows where over 10% of the GT calls are NA. I converted GT calls into numbers, such as 0/0 → 0 , 1/0 or 0/1 → 10, 1/1 → 11, 1/2 or 2/1 → 21, 9/15 → 159, etc. - essentially appending the left side of the “/” with the right side of the “/” with the bigger number coming first each time. Finally, I set the missing values in each row (variant) to the mean of the defined values in each row. Of course, I centered the GT entry (now numeric) data of each variant using the mean of entries for that variant and standardized the data by dividing each entry for a specific variant with the standard deviation of the entries for that specific variant. I’m really curious to how this process differs to hwe_normalized_pca? Thank You!

patrick-schultz · March 25, 2024, 1:31pm

Hi @Guntaas,

Have you looked at the docs for hwe_normalized_pca? They cover 1. and 2. pretty well. In summary:

It normalizes the count of alternate alleles, but instead of using the empirical variance, it computes the empirical mean (which is twice the allele frequency), and uses the variance according to the Hardy-Weinberg equilibrium model. This is the standard practice for PCA on genotypic data.
Basically none. It filters out monomorphic sites, as they can’t be normalized, but otherwise you should do all your QC before running PCA.
It mean imputes missing genotypes. Since the genotypes have been normalized, that means it replaces missing genotypes with 0.

For complete details, see the method hwe_normalize here. It’s pretty simple.

Guntaas · March 26, 2024, 5:00pm

Thank you!

Topic		Replies	Views
ARPACK randomly not converging during hwe_normalized_pca Hail Query & hailctl	5	526	November 17, 2020
PCA to output allele frequencies alongside loadings? Feature Requests	7	1331	April 24, 2019
Arrangement of the scores output for hl.pca Hail Query & hailctl	4	128	March 18, 2024
PCA filtering samples? Hail Query & hailctl	2	405	April 14, 2020
Py4JError: An error occurred while calling o1.pyPersistTable when running hl.hwe_normalized_pca() Hail Query & hailctl	4	497	June 7, 2023

Understanding how hwe_normalized_pca works

Related topics