PCA to output allele frequencies alongside loadings?

Title says most of it. I feel like hwe_normalized_pca could output the frequency it uses to center alongside the loadings. I’m doing this for downstream projection, which presumably is most of the use cases for outputting the loadings in the first place.

Right now my process looks like:

eig, scores, loadings = hl.hwe_normalized_pca(mt)
mt = mt.annotate(af=hl.agg.call_stats(mt.GT, mt.alleles).AF[1])
loadings = loadings.annotate(af=mt[loadings.locus])

But this seems silly and wasteful.

1 Like

Since you’re likely filtering beforehand based on an AF annotation on the MatrixTable, I think it’s reasonable to use the generic PCA functionality with the HWE normalizing expression, so that the AF need not be recalculated, and then to annotate AF on the loadings from the rows of the MatrixTable. You’d modify this expression from the Python code:

    entry_expr = hl.bind(
        dataset.AC / dataset.n_called,
        lambda mean_gt: hl.cond(hl.is_defined(dataset.GT),
                                (dataset.GT.n_alt_alleles() - mean_gt) /
                                hl.sqrt(mean_gt * (2 - mean_gt) * n_variants / 2),
                                0))

I’m less sure how the latter join will perform.

Or if this is really a common use case after running hwe_normalized_pca, we could consider adding AF to loadings as well, but only for that specialized PCA function.

Hi!

I am following up on the thread about this subject on the hail chat: https://hail.zulipchat.com/#narrow/stream/123010-Hail-0.2E2.20support/topic/PCA.20projections.3F

Using the 1K genome data and splitting this into two groups of the same samples but different variants:

pca_mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.05)
related_mt = mt.filter_rows(mt.variant_qc.AF[1] <= 0.05)

Code from the chat mentioned above:

pca_evals, pca_scores, pca_loadings = hl.hwe_normalized_pca(pca_mt.GT, k=20, compute_loadings=True)
pca_mt = pca_mt.annotate_rows(pca_af=hl.agg.mean(pca_mt.GT.n_alt_alleles()) / 2)
pca_loadings = pca_loadings.annotate(pca_af=pca_mt[pca_loadings.key, :].pca_af)
related_scores = pc_project(related_mt, pca_loadings)

it goes wrong at:

pca_loadings = pca_loadings.annotate(pca_af=pca_mt[pca_loadings.key, :].pca_af)

Returning the error:
TypeError: MatrixTable.getitem: invalid index argument(s)
Usage 1: field selection ( mt[‘field’] )
Usage 2: Entry joining ( mt[mt2.row_key, mt2.col_key] )

Do you have any idea what is going wrong here?

this is pretty old – the mt[row keys, :] syntax is deprecated. Instead use mt.rows()[row keys] or mt.index_rows(row keys)

We’ve updated pc_project to use the new syntax. The code lives here: https://github.com/macarthur-lab/gnomad_hail/blob/537cb9dd19c4a854a9ec7f29e552129081598399/utils/generic.py#L105

@tpoterba and @konradjk Great, many thanks!

pca_loadings = pca_loadings.annotate(pca_af = pca_mt.rows()[pca_loadings.locus, pca_loadings.alleles].pca_af)

I also have a question about pc_project() function @konradjk , whether it is possible to use this (or something similar) to calculate the loadings (instead of the scores) for the mt containing the variants to project.

pc_project() is build to take loadings from pca, calculate scores in order to project samples onto PC. However, here the matrix table is splitted into two matrixtables with the same samples (scores), but different variants(loadings). I can thus directly use the scores calculated from the pca_mt for related_mt, but the loadings need to be calculated.
I am quit new at this and hoped you would know how to solve this or refer me to comprehensive documentation about this?

If you have two separate tables with the same sample ids but disjoint variants, something like the following might be what you need.

I generate two random datasets, mt1 and mt2. This model will generate the same loci for each dataset, so I shift the second dataset’s locus positions such that the datasets’ loci are disjoint. I run PCA separately for each dataset (this might not be well founded from a statistical point of view, but it demonstrates having two separate sources of loadings). This produces two tables of loadings, loadings1 and loadings2. I annotate the first loadings first (nb: in reality you should mean center and variance normalize the GTs first, see pc_project line 132). As you can see from the show the projected genotypes from the first dataset have real values and the projected genotypes from the second dataset (position>50) have NA. Finally, I project the remaining genotypes using loadings2.

Upon a second reading, I think I have not addressed your question directly! If you can clarify a bit I can give a better example :slight_smile:.

In [28]: import hail as hl 
    ...:  
    ...: mt1 = hl.balding_nichols_model(3, 50, 50) 
    ...: mt2 = hl.balding_nichols_model(3, 50, 50) 
    ...: mt1.describe() 
    ...: mt2 = mt2.key_rows_by(locus=hl.locus(mt2.locus.contig, mt2.locus.position+50), alleles=mt2.alleles) 
    ...: _, _, loadings1 = hl.hwe_normalized_pca(mt1.GT, k=10, compute_loadings=True) 
    ...: _, _, loadings2 = hl.hwe_normalized_pca(mt2.GT, k=10, compute_loadings=True) 
    ...: union_mt = mt1.union_rows(mt2) 
    ...: union_mt.describe() 
    ...: loadings1.describe() 
    ...: union_mt = union_mt.annotate_entries(projected = loadings1[union_mt.row_key].loadings * union_mt.GT.n_alt_alleles()) 
    ...: union_mt.filter_rows(union_mt.locus.position > 45).show(n_cols=1) 
    ...: union_mt = union_mt.annotate_entries(projected = hl.or_else(union_mt.projected, loadings2[union_mt.row_key].loadings * union_mt.GT.n
    ...: _alt_alleles())) 
    ...: union_mt.filter_rows(union_mt.locus.position > 45).show(n_cols=1) 
    ...:                                                                                                                                     
2019-04-24 10:21:18 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 50 samples, and 50 variants...
2019-04-24 10:21:18 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 50 samples, and 50 variants...
----------------------------------------
Global fields:
    'bn': struct {
        n_populations: int32, 
        n_samples: int32, 
        n_variants: int32, 
        n_partitions: int32, 
        pop_dist: array<int32>, 
        fst: array<float64>, 
        mixture: bool
    }
----------------------------------------
Column fields:
    'sample_idx': int32
    'pop': int32
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'ancestral_af': float64
    'af': array<float64>
----------------------------------------
Entry fields:
    'GT': call
----------------------------------------
Column key: ['sample_idx']
Row key: ['locus', 'alleles']
----------------------------------------
2019-04-24 10:21:18 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:18 Hail: INFO: hwe_normalized_pca: running PCA using 50 variants.
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: pca: running PCA with 10 components...
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: hwe_normalized_pca: running PCA using 50 variants.
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:20 Hail: INFO: pca: running PCA with 10 components...
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
----------------------------------------
Global fields:
    'bn': struct {
        n_populations: int32, 
        n_samples: int32, 
        n_variants: int32, 
        n_partitions: int32, 
        pop_dist: array<int32>, 
        fst: array<float64>, 
        mixture: bool
    }
----------------------------------------
Column fields:
    'sample_idx': int32
    'pop': int32
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'ancestral_af': float64
    'af': array<float64>
----------------------------------------
Entry fields:
    'GT': call
----------------------------------------
Column key: ['sample_idx']
Row key: ['locus', 'alleles']
----------------------------------------
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'locus': locus<GRCh37> 
    'alleles': array<str> 
    'loadings': array<float64> 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
| locus         | alleles    | 0.GT | 0.projected                                                                                       |
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
| locus<GRCh37> | array<str> | call | array<float64>                                                                                    |
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
| 1:46          | ["A","C"]  | 0/1  | [1.87e-02,-4.67e-02,-2.28e-01,-1.08e-01,-7.43e-02,9.89e-03,2.69e-02,-3.23e-01,1.42e-01,-7.83e-02] |
| 1:47          | ["A","C"]  | 0/1  | [5.09e-02,-8.10e-02,-2.04e-01,4.71e-02,7.70e-03,-8.92e-04,3.42e-03,-1.63e-01,8.47e-02,-2.53e-01]  |
| 1:48          | ["A","C"]  | 0/0  | [0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,-0.00e+00,-0.00e+00]  |
| 1:49          | ["A","C"]  | 0/0  | [-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00]  |
| 1:50          | ["A","C"]  | 0/0  | [-0.00e+00,-0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00]   |
| 1:51          | ["A","C"]  | 0/1  | NA                                                                                                |
| 1:52          | ["A","C"]  | 0/1  | NA                                                                                                |
| 1:53          | ["A","C"]  | 1/1  | NA                                                                                                |
| 1:54          | ["A","C"]  | 1/1  | NA                                                                                                |
| 1:55          | ["A","C"]  | 0/1  | NA                                                                                                |
| 1:56          | ["A","C"]  | 1/1  | NA                                                                                                |
| 1:57          | ["A","C"]  | 1/1  | NA                                                                                                |
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
showing top 12 rows
showing the first 1 of 50 columns
2019-04-24 10:21:22 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:23 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:23 Hail: INFO: Coerced sorted dataset
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
| locus         | alleles    | 0.GT | 0.projected                                                                                         |
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
| locus<GRCh37> | array<str> | call | array<float64>                                                                                      |
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
| 1:46          | ["A","C"]  | 0/1  | [1.87e-02,-4.67e-02,-2.28e-01,-1.08e-01,-7.43e-02,9.89e-03,2.69e-02,-3.23e-01,1.42e-01,-7.83e-02]   |
| 1:47          | ["A","C"]  | 0/1  | [5.09e-02,-8.10e-02,-2.04e-01,4.71e-02,7.70e-03,-8.92e-04,3.42e-03,-1.63e-01,8.47e-02,-2.53e-01]    |
| 1:48          | ["A","C"]  | 0/0  | [0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,-0.00e+00,-0.00e+00]    |
| 1:49          | ["A","C"]  | 0/0  | [-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00]    |
| 1:50          | ["A","C"]  | 0/0  | [-0.00e+00,-0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00]     |
| 1:51          | ["A","C"]  | 0/1  | [-2.15e-01,-1.44e-01,-9.22e-02,-5.71e-02,1.75e-01,-1.55e-01,3.61e-02,-1.58e-01,-3.50e-01,-1.33e-02] |
| 1:52          | ["A","C"]  | 0/1  | [8.60e-02,1.95e-01,-1.44e-01,1.68e-01,-9.61e-02,8.25e-02,-1.30e-01,2.98e-01,-3.11e-02,7.92e-03]     |
| 1:53          | ["A","C"]  | 1/1  | [-2.80e-01,3.27e-02,-6.19e-02,-4.96e-01,-1.51e-01,-1.11e-01,-1.33e-02,1.94e-01,3.67e-01,1.92e-01]   |
| 1:54          | ["A","C"]  | 1/1  | [4.88e-03,3.18e-02,4.02e-01,-3.91e-02,-6.12e-02,3.79e-01,7.26e-02,6.10e-01,-5.12e-02,5.79e-02]      |
| 1:55          | ["A","C"]  | 0/1  | [-1.22e-01,-1.60e-01,2.32e-01,6.86e-02,-3.77e-02,3.49e-01,-7.54e-02,-1.42e-02,5.03e-02,-8.00e-02]   |
| 1:56          | ["A","C"]  | 1/1  | [3.86e-01,-1.29e-01,3.57e-01,-2.20e-01,7.39e-02,3.87e-01,-1.55e-01,-4.19e-01,3.56e-01,8.60e-02]     |
| 1:57          | ["A","C"]  | 1/1  | [-1.05e-01,-5.41e-01,-2.04e-01,-1.28e-01,-6.17e-02,1.75e-02,-2.59e-02,-3.15e-01,1.03e-01,-9.13e-02] |
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
showing top 12 rows
showing the first 1 of 50 columns

In [29]: