Save PCs for projection

I want to save the PCs from a (reference) set I run a PCA on, then project another set of samples onto that PC space.

1 Like

Thanks for the request! In fact, you can do this already using aggregables in the Hail expression language. Run the following from the cloned hail directory to demonstrate sample projection using 3 reference panel loading PCs, linear regression using the resulting score PCs as covariates, and computation of polygenic risk scores.

mkdir example

# assign samples at random to ref panel or study group
hail \
  importvcf src/test/resources/sample.vcf \
  splitmulti \
  annotatesamples expr -c 'sa.inRefPanel = pcoin(.5)' \
  write -o example/sample.vds

hail \
  read -i example/sample.vds \
  filtersamples expr -c 'sa.inRefPanel' --keep \
  write -o example/refPanel.vds

hail \
  read -i example/sample.vds \
  filtersamples expr -c 'sa.inRefPanel' --remove \
  write -o example/study.vds

# compute and export AF and PC loadings for reference panel
hail \
  read -i example/refPanel.vds \
  variantqc \
  exportvariants -c 'Variant = v, refPanelAF = va.qc.AF' \
    -o example/refPanelAF.tsv \
  pca -k 3 \
    -o example/refPanel.scores.tsv \
    -l example/refPanel.loadings.tsv \
    -e example/refPanel.evals.tsv

# annotate variants with AF and PC loadings from reference panel
# combine loadings into array annotation
# annotate samples with PC scores, random sex, random case-control status
# run linear regression of case-control status against sex and PC scores
# compute and export polygenic risk
hail \
  read -i example/study.vds \
  variantqc \
  annotatevariants table example/refPanelAF.tsv \
    -e Variant \
    -c 'va.refPanelAF = table.refPanelAF' \
    --impute \
  annotatevariants table example/refPanel.loadings.tsv \
    -e 'Variant(chrom, pos, ref, alt)' \
    -t 'chrom: String' \
    -c 'va = merge(va, select(table, PC1, PC2, PC3))' \
    --impute \
  annotatevariants expr -c 'va.PCs = [va.PC1, va.PC2, va.PC3]' \
  annotatesamples expr -c 'sa.PCs = gs.map(g => let p = va.refPanelAF in if (p == 0 || p == 1) [0.0, 0.0, 0.0] else (g.gt - 2 * p) / sqrt(2 * p * (1 - p)) * va.PCs).sum(), sa.isFemale = pcoin(.5), sa.isCase = pcoin(.5)' \
  linreg \
    -y sa.isCase \
    -c sa.isFemale,sa.PCs[0],sa.PCs[1],sa.PCs[2] \
  exportsamples -c 'Sample = s, polyRisk = gs.map(g => g.gt.toDouble.orElse(2 * va.qc.AF) * va.linreg.beta).sum()' \
    -o example/polyRisk.tsv \
  printschema \
  write -o example/study.pcs.linreg.risk.vds

As described in the PCA docs, for each sample, the projection is given by the product of the 1 x m vector of standardized genotypes from the study and the m x 3 matrix of PC loadings from the reference panel:

'sa.PCs = gs.map(g => let p = va.refPanelAF in if (p == 0 || p == 1) [0.0, 0.0, 0.0] else (g.gt - 2 * p) / (2 * p * (1 - p)) * va.PCs).sum()'

Following linear regression and the risk computation, the key elements of the resulting annotation schema in study.pcs.linreg.risk.vds are:

Sample annotation schema:
sa: Struct {
    ...
    PCs: Array[Double],
    isFemale: Boolean,
    isCase: Boolean
}

Variant annotation schema:
va: Struct {
    ...
    PCs: Array[Double],
    linreg: Struct {
        beta: Double,
        se: Double,
        tstat: Double,
        pval: Double
    }

Note that polygenic risk does not appear as a sample annotation because we directly exported the result of the computation; we could have instead annotated by polygenic risk and then export of the resulting annotation. We have several changes coming that should improve this workflow and will update this thread accordingly.

1 Like

This works well. I think the genotypes should be normalized as
(g.gt - 2 * p) / sqrt(2 * p * (1 - p))
rather than
(g.gt - 2 * p) / (2 * p * (1 - p))
and then optionally scaled.

Yes, made the change. And more are coming since we just changed PCA to directly annotate rather than export.