Thank you so much for the response. I utilized hl.hwe_normalized_pca and achieved slightly different results in terms of the pca clustering (in a way that I would prefer). I’m wondering if you could perhaps explain or point me in the right direction regarding how hwe_normalized_pca normalizes the GT call data. This is my first time doing a PCA so I apologize for my lack of understand.
I’m wondering what I’m missing or how my method differs with hwe_normalized_pca. This is how I standardized the GT data so that it could be processed using hl.pca:
Convert entries with a genotype quality of below 20 to NA
mt1_filtered = mt1.annotate_entries(GQ=hl.if_else((mt1.GQ < 20), hl.missing(mt1.GQ.dtype), mt1.GQ))
Remove rows which have a genotype quality of NA in over 10% of columns
cond = hl.agg.count_where(hl.is_missing(mt1_filtered.GQ))
mt1_filtered2 = mt1_filtered.filter_rows((cond/mt1_filtered.count_cols()) <= 0.1)
Converting Call Expression to a Float
I then converted calls into numbers, such as 0/0 → 0 , 1/0 or 0/1 → 1, 1/1 → 1, 1/2 or 2/1 → 21 etc. pretty much appending the left side of the / with the right side of the / with the bigger number coming first each time. (I can share the code for clarity but I don’t want to waste your time)
I then converted every NA values remaining to 0 and calculated the mean of each row (with only the defined numbers, not including NA’s) and then set the NA’s in each respective row to the mean values of each row. (I can share the code for further clarity but I don’t want to waste your time)
The rest of the steps are standard and don’t seem like the cause of variation between my method and hl.hwe_normalized_pca:
I then calculated the mean of each row again
mt1_filtered3 = mt1_filtered3.annotate_rows(mean_GT= hl.agg.mean(mt1_filtered3.intermediary_GT2))
I then calculated the SD of each row
mt1_filtered3 = mt1_filtered3.annotate_rows(SD_GT= hl.sqrt(hl.agg.sum((mt1_filtered3.intermediary_GT2-
mt1_filtered3.mean_GT)**2)
/(mt1_filtered3.count_cols()-1)))
I calculated the final standardized numerical GT for each GT entry
mt1_filtered3 = mt1_filtered3.annotate_entries(final_GT = ((mt1_filtered3.intermediary_GT2-mt1_filtered3.mean_GT)
/ mt1_filtered3.SD_GT))
I then utilized this data for the pca
eigenvalues, scores, loadings = hl.pca(mt1_filtered3.final_GT, k=10, compute_loadings=True)