Hi,
I’m wondering if anyone could please share or point me in the right direction regarding understanding how hwe_normalized_pca works? I’ve used it alongside my own method of preparing data for a PCA and wanted to understand it in greater detail, to better interpret the clustering I get with hwe_normalized_pca. I’m particularly interested in knowing:
- how hwe_normalized_pca converts genotype calls such as “0/0”, “1/0”, etc into numeric values that can be inputted into a pca?
- What quality control thresholds it sets for rows (variants) and individuals (columns)?
- How it imputes missing values?
I processed my data for hl.pca, by setting all GT calls with a GQ value below 20 to NA, and removing rows where over 10% of the GT calls are NA. I converted GT calls into numbers, such as 0/0 → 0 , 1/0 or 0/1 → 10, 1/1 → 11, 1/2 or 2/1 → 21, 9/15 → 159, etc. - essentially appending the left side of the “/” with the right side of the “/” with the bigger number coming first each time. Finally, I set the missing values in each row (variant) to the mean of the defined values in each row. Of course, I centered the GT entry (now numeric) data of each variant using the mean of entries for that variant and standardized the data by dividing each entry for a specific variant with the standard deviation of the entries for that specific variant. I’m really curious to how this process differs to hwe_normalized_pca? Thank You!