Arrangement of the scores output for hl.pca

Guntaas · March 18, 2024, 1:07am

Hi,

I’m making a PCA where my rows are specific SNPs and columns are individuals, with the entries being the GT calls. I’ve modified and standardized the GT calls. I subsequently performed hl.pca to the ascertain eigenvalues, scores and loadings for the amount of components I chose (10).

Moving on towards my question, I’m seeking clarification regarding the arrangement of the scores in the scores list for each component, created using hl.pca. Are the scores (with each representing a column) organized in the same order as the original column arrangement or do they follow a numerical order in ascending fashion. To further explain what I mean, these two functions output columns in ascending numerical order and the original order respectively:

mt.cols().collect - creates a list with the columns arranged in ascending order
mt.col.collect - creates a list with the columns in the original order

I apologize for any confusion regarding my question, I wanted to know the answer to this so that I can annotate the points in my PCA accurately. Thank You!

patrick-schultz · March 18, 2024, 12:44pm

Hi @Guntaas,

For future reference, we have the method hwe_normalized_pca which handles the standardization and mean-centering for you, as well as edge-cases like filtering out monomorphic sites (which can’t be normalized).

The scores are returned as a table, keyed by your original column keys (e.g. sample id). So they are in ascending order of sample id, as all tables are ordered by key. But also, there shouldn’t be any ambiguity over how the scores map back to the samples.

For example, if you wanted to annotate your original matrixtable with the scores, you could use

eigvals, scores, _ = hl.hwe_normalized_pca(mt.GT)
mt = mt.annotate_cols(scores=scores[mt.col_key])

Does that answer your question?

Guntaas · March 18, 2024, 2:51pm

Thank you for informing me about hwe_normalized_pca.

The scores that are returned to me as a table do not have columns, so I am slightly confused by what you mean when you say that they are keyed by the original column key. The scores are returned as a list of scores in a specific order, which you informed me is in ascending order, corresponding to a specific row (with each row being a component of the PCA).

But your method of annotating the original matrix table answered my question. Thank you for the guidance and I apologize for my ignorance!

patrick-schultz · March 18, 2024, 2:57pm

You mentioned that mt.cols().collect creates a list with the columns arranged in ascending order. The reason is that mt.cols() creates a table whose rows correspond to the colums of mt, with a field for every column field of mt, and which is keyed by the column key of mt. And collect on a table creates a list of rows ordered by key.

The scores table returned by PCA is similar: the rows correspond to the columns of the original matrixtable, keyed by the original column key, which is what lets us annotate the scores back onto the matrixtable.

Hope that helps clarify things a bit!

Guntaas · March 18, 2024, 6:16pm

Thank you so much for the response. I utilized hl.hwe_normalized_pca and achieved slightly different results in terms of the pca clustering (in a way that I would prefer). I’m wondering if you could perhaps explain or point me in the right direction regarding how hwe_normalized_pca normalizes the GT call data. This is my first time doing a PCA so I apologize for my lack of understand.

I’m wondering what I’m missing or how my method differs with hwe_normalized_pca. This is how I standardized the GT data so that it could be processed using hl.pca:

Convert entries with a genotype quality of below 20 to NA

mt1_filtered = mt1.annotate_entries(GQ=hl.if_else((mt1.GQ < 20), hl.missing(mt1.GQ.dtype), mt1.GQ))

Remove rows which have a genotype quality of NA in over 10% of columns

cond = hl.agg.count_where(hl.is_missing(mt1_filtered.GQ))
mt1_filtered2 = mt1_filtered.filter_rows((cond/mt1_filtered.count_cols()) <= 0.1)

Converting Call Expression to a Float

I then converted calls into numbers, such as 0/0 → 0 , 1/0 or 0/1 → 1, 1/1 → 1, 1/2 or 2/1 → 21 etc. pretty much appending the left side of the / with the right side of the / with the bigger number coming first each time. (I can share the code for clarity but I don’t want to waste your time)

I then converted every NA values remaining to 0 and calculated the mean of each row (with only the defined numbers, not including NA’s) and then set the NA’s in each respective row to the mean values of each row. (I can share the code for further clarity but I don’t want to waste your time)

The rest of the steps are standard and don’t seem like the cause of variation between my method and hl.hwe_normalized_pca:

I then calculated the mean of each row again

mt1_filtered3 = mt1_filtered3.annotate_rows(mean_GT= hl.agg.mean(mt1_filtered3.intermediary_GT2))

I then calculated the SD of each row

mt1_filtered3 = mt1_filtered3.annotate_rows(SD_GT= hl.sqrt(hl.agg.sum((mt1_filtered3.intermediary_GT2-
mt1_filtered3.mean_GT)**2)
/(mt1_filtered3.count_cols()-1)))

I calculated the final standardized numerical GT for each GT entry

mt1_filtered3 = mt1_filtered3.annotate_entries(final_GT = ((mt1_filtered3.intermediary_GT2-mt1_filtered3.mean_GT)
/ mt1_filtered3.SD_GT))

I then utilized this data for the pca

eigenvalues, scores, loadings = hl.pca(mt1_filtered3.final_GT, k=10, compute_loadings=True)

Topic		Replies	Views
Question about Ordering unsorted dataset with network shuffle Hail Query & hailctl	0	128	May 16, 2024
Prepare hail entries for spark.ml Hail Query & hailctl	3	444	October 29, 2019
PCA plot for data compared with ancestry reference (1000 Genomes) Hail Query & hailctl	1	1073	October 15, 2019
PCA filtering samples? Hail Query & hailctl	2	397	April 14, 2020
Group by columns and aggregate entries over all entries in the group Hail Query & hailctl	2	440	August 30, 2021