Understanding how hwe_normalized_pca works

Hi,

I’m wondering if anyone could please share or point me in the right direction regarding understanding how hwe_normalized_pca works? I’ve used it alongside my own method of preparing data for a PCA and wanted to understand it in greater detail, to better interpret the clustering I get with hwe_normalized_pca. I’m particularly interested in knowing:

  1. how hwe_normalized_pca converts genotype calls such as “0/0”, “1/0”, etc into numeric values that can be inputted into a pca?
  2. What quality control thresholds it sets for rows (variants) and individuals (columns)?
  3. How it imputes missing values?

I processed my data for hl.pca, by setting all GT calls with a GQ value below 20 to NA, and removing rows where over 10% of the GT calls are NA. I converted GT calls into numbers, such as 0/0 → 0 , 1/0 or 0/1 → 10, 1/1 → 11, 1/2 or 2/1 → 21, 9/15 → 159, etc. - essentially appending the left side of the “/” with the right side of the “/” with the bigger number coming first each time. Finally, I set the missing values in each row (variant) to the mean of the defined values in each row. Of course, I centered the GT entry (now numeric) data of each variant using the mean of entries for that variant and standardized the data by dividing each entry for a specific variant with the standard deviation of the entries for that specific variant. I’m really curious to how this process differs to hwe_normalized_pca? Thank You!

Hi @Guntaas,

Have you looked at the docs for hwe_normalized_pca? They cover 1. and 2. pretty well. In summary:

  1. It normalizes the count of alternate alleles, but instead of using the empirical variance, it computes the empirical mean (which is twice the allele frequency), and uses the variance according to the Hardy-Weinberg equilibrium model. This is the standard practice for PCA on genotypic data.
  2. Basically none. It filters out monomorphic sites, as they can’t be normalized, but otherwise you should do all your QC before running PCA.
  3. It mean imputes missing genotypes. Since the genotypes have been normalized, that means it replaces missing genotypes with 0.

For complete details, see the method hwe_normalize here. It’s pretty simple.

1 Like

Thank you!