Save PCs for projection

bavila · October 6, 2016, 7:01pm

I want to save the PCs from a (reference) set I run a PCA on, then project another set of samples onto that PC space.

jbloom · October 10, 2016, 5:36am

Thanks for the request! In fact, you can do this already using aggregables in the Hail expression language. Run the following from the cloned hail directory to demonstrate sample projection using 3 reference panel loading PCs, linear regression using the resulting score PCs as covariates, and computation of polygenic risk scores.

mkdir example

# assign samples at random to ref panel or study group
hail \
  importvcf src/test/resources/sample.vcf \
  splitmulti \
  annotatesamples expr -c 'sa.inRefPanel = pcoin(.5)' \
  write -o example/sample.vds

hail \
  read -i example/sample.vds \
  filtersamples expr -c 'sa.inRefPanel' --keep \
  write -o example/refPanel.vds

hail \
  read -i example/sample.vds \
  filtersamples expr -c 'sa.inRefPanel' --remove \
  write -o example/study.vds

# compute and export AF and PC loadings for reference panel
hail \
  read -i example/refPanel.vds \
  variantqc \
  exportvariants -c 'Variant = v, refPanelAF = va.qc.AF' \
    -o example/refPanelAF.tsv \
  pca -k 3 \
    -o example/refPanel.scores.tsv \
    -l example/refPanel.loadings.tsv \
    -e example/refPanel.evals.tsv

# annotate variants with AF and PC loadings from reference panel
# combine loadings into array annotation
# annotate samples with PC scores, random sex, random case-control status
# run linear regression of case-control status against sex and PC scores
# compute and export polygenic risk
hail \
  read -i example/study.vds \
  variantqc \
  annotatevariants table example/refPanelAF.tsv \
    -e Variant \
    -c 'va.refPanelAF = table.refPanelAF' \
    --impute \
  annotatevariants table example/refPanel.loadings.tsv \
    -e 'Variant(chrom, pos, ref, alt)' \
    -t 'chrom: String' \
    -c 'va = merge(va, select(table, PC1, PC2, PC3))' \
    --impute \
  annotatevariants expr -c 'va.PCs = [va.PC1, va.PC2, va.PC3]' \
  annotatesamples expr -c 'sa.PCs = gs.map(g => let p = va.refPanelAF in if (p == 0 || p == 1) [0.0, 0.0, 0.0] else (g.gt - 2 * p) / sqrt(2 * p * (1 - p)) * va.PCs).sum(), sa.isFemale = pcoin(.5), sa.isCase = pcoin(.5)' \
  linreg \
    -y sa.isCase \
    -c sa.isFemale,sa.PCs[0],sa.PCs[1],sa.PCs[2] \
  exportsamples -c 'Sample = s, polyRisk = gs.map(g => g.gt.toDouble.orElse(2 * va.qc.AF) * va.linreg.beta).sum()' \
    -o example/polyRisk.tsv \
  printschema \
  write -o example/study.pcs.linreg.risk.vds

As described in the PCA docs, for each sample, the projection is given by the product of the 1 x m vector of standardized genotypes from the study and the m x 3 matrix of PC loadings from the reference panel:

'sa.PCs = gs.map(g => let p = va.refPanelAF in if (p == 0 || p == 1) [0.0, 0.0, 0.0] else (g.gt - 2 * p) / (2 * p * (1 - p)) * va.PCs).sum()'

Following linear regression and the risk computation, the key elements of the resulting annotation schema in study.pcs.linreg.risk.vds are:

Sample annotation schema:
sa: Struct {
    ...
    PCs: Array[Double],
    isFemale: Boolean,
    isCase: Boolean
}

Variant annotation schema:
va: Struct {
    ...
    PCs: Array[Double],
    linreg: Struct {
        beta: Double,
        se: Double,
        tstat: Double,
        pval: Double
    }

Note that polygenic risk does not appear as a sample annotation because we directly exported the result of the computation; we could have instead annotated by polygenic risk and then export of the resulting annotation. We have several changes coming that should improve this workflow and will update this thread accordingly.

bavila · October 12, 2016, 5:09pm

This works well. I think the genotypes should be normalized as
(g.gt - 2 * p) / sqrt(2 * p * (1 - p))
rather than
(g.gt - 2 * p) / (2 * p * (1 - p))
and then optionally scaled.

jbloom · October 14, 2016, 12:40am

Yes, made the change. And more are coming since we just changed PCA to directly annotate rather than export.

Topic		Replies	Views
PCA Projection onto existing PCA Hail Query & hailctl	5	292	September 22, 2023
Running Hail on Databricks Help [0.1]	5	1341	March 29, 2017
Selectively recompute p-values with different regressions Hail Query & hailctl	1	436	January 7, 2021
Generate synthetic datasets with the Balding-Nichols model Updates	0	939	February 2, 2017
Log of breaking changes in 0.2 beta Updates	17	3101	August 15, 2018

Save PCs for projection

Related Topics