PCA to output allele frequencies alongside loadings?

konradjk · March 12, 2018, 4:57pm

Title says most of it. I feel like hwe_normalized_pca could output the frequency it uses to center alongside the loadings. I’m doing this for downstream projection, which presumably is most of the use cases for outputting the loadings in the first place.

Right now my process looks like:

eig, scores, loadings = hl.hwe_normalized_pca(mt)
mt = mt.annotate(af=hl.agg.call_stats(mt.GT, mt.alleles).AF[1])
loadings = loadings.annotate(af=mt[loadings.locus])

But this seems silly and wasteful.

jbloom · March 12, 2018, 5:49pm

Since you’re likely filtering beforehand based on an AF annotation on the MatrixTable, I think it’s reasonable to use the generic PCA functionality with the HWE normalizing expression, so that the AF need not be recalculated, and then to annotate AF on the loadings from the rows of the MatrixTable. You’d modify this expression from the Python code:

    entry_expr = hl.bind(
        dataset.AC / dataset.n_called,
        lambda mean_gt: hl.cond(hl.is_defined(dataset.GT),
                                (dataset.GT.n_alt_alleles() - mean_gt) /
                                hl.sqrt(mean_gt * (2 - mean_gt) * n_variants / 2),
                                0))

I’m less sure how the latter join will perform.

jbloom · March 12, 2018, 5:59pm

Or if this is really a common use case after running hwe_normalized_pca, we could consider adding AF to loadings as well, but only for that specialized PCA function.

lindsayvm · April 22, 2019, 3:16pm

Hi!

I am following up on the thread about this subject on the hail chat: https://hail.zulipchat.com/#narrow/stream/123010-Hail-0.2E2.20support/topic/PCA.20projections.3F

Using the 1K genome data and splitting this into two groups of the same samples but different variants:

pca_mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.05)
related_mt = mt.filter_rows(mt.variant_qc.AF[1] <= 0.05)

Code from the chat mentioned above:

pca_evals, pca_scores, pca_loadings = hl.hwe_normalized_pca(pca_mt.GT, k=20, compute_loadings=True)
pca_mt = pca_mt.annotate_rows(pca_af=hl.agg.mean(pca_mt.GT.n_alt_alleles()) / 2)
pca_loadings = pca_loadings.annotate(pca_af=pca_mt[pca_loadings.key, :].pca_af)
related_scores = pc_project(related_mt, pca_loadings)

it goes wrong at:

pca_loadings = pca_loadings.annotate(pca_af=pca_mt[pca_loadings.key, :].pca_af)

Returning the error:
TypeError: MatrixTable.getitem: invalid index argument(s)
Usage 1: field selection ( mt[‘field’] )
Usage 2: Entry joining ( mt[mt2.row_key, mt2.col_key] )

Do you have any idea what is going wrong here?

tpoterba · April 22, 2019, 4:03pm

this is pretty old – the mt[row keys, :] syntax is deprecated. Instead use mt.rows()[row keys] or mt.index_rows(row keys)

konradjk · April 22, 2019, 4:24pm

We’ve updated pc_project to use the new syntax. The code lives here: https://github.com/macarthur-lab/gnomad_hail/blob/537cb9dd19c4a854a9ec7f29e552129081598399/utils/generic.py#L105

lindsayvm · April 23, 2019, 6:20pm

@tpoterba and @konradjk Great, many thanks!

pca_loadings = pca_loadings.annotate(pca_af = pca_mt.rows()[pca_loadings.locus, pca_loadings.alleles].pca_af)

I also have a question about pc_project() function @konradjk , whether it is possible to use this (or something similar) to calculate the loadings (instead of the scores) for the mt containing the variants to project.

pc_project() is build to take loadings from pca, calculate scores in order to project samples onto PC. However, here the matrix table is splitted into two matrixtables with the same samples (scores), but different variants(loadings). I can thus directly use the scores calculated from the pca_mt for related_mt, but the loadings need to be calculated.
I am quit new at this and hoped you would know how to solve this or refer me to comprehensive documentation about this?

danking · April 24, 2019, 2:22pm

If you have two separate tables with the same sample ids but disjoint variants, something like the following might be what you need.

I generate two random datasets, mt1 and mt2. This model will generate the same loci for each dataset, so I shift the second dataset’s locus positions such that the datasets’ loci are disjoint. I run PCA separately for each dataset (this might not be well founded from a statistical point of view, but it demonstrates having two separate sources of loadings). This produces two tables of loadings, loadings1 and loadings2. I annotate the first loadings first (nb: in reality you should mean center and variance normalize the GTs first, see pc_project line 132). As you can see from the show the projected genotypes from the first dataset have real values and the projected genotypes from the second dataset (position>50) have NA. Finally, I project the remaining genotypes using loadings2.

Upon a second reading, I think I have not addressed your question directly! If you can clarify a bit I can give a better example .

In [28]: import hail as hl 
    ...:  
    ...: mt1 = hl.balding_nichols_model(3, 50, 50) 
    ...: mt2 = hl.balding_nichols_model(3, 50, 50) 
    ...: mt1.describe() 
    ...: mt2 = mt2.key_rows_by(locus=hl.locus(mt2.locus.contig, mt2.locus.position+50), alleles=mt2.alleles) 
    ...: _, _, loadings1 = hl.hwe_normalized_pca(mt1.GT, k=10, compute_loadings=True) 
    ...: _, _, loadings2 = hl.hwe_normalized_pca(mt2.GT, k=10, compute_loadings=True) 
    ...: union_mt = mt1.union_rows(mt2) 
    ...: union_mt.describe() 
    ...: loadings1.describe() 
    ...: union_mt = union_mt.annotate_entries(projected = loadings1[union_mt.row_key].loadings * union_mt.GT.n_alt_alleles()) 
    ...: union_mt.filter_rows(union_mt.locus.position > 45).show(n_cols=1) 
    ...: union_mt = union_mt.annotate_entries(projected = hl.or_else(union_mt.projected, loadings2[union_mt.row_key].loadings * union_mt.GT.n
    ...: _alt_alleles())) 
    ...: union_mt.filter_rows(union_mt.locus.position > 45).show(n_cols=1) 
    ...:                                                                                                                                     
2019-04-24 10:21:18 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 50 samples, and 50 variants...
2019-04-24 10:21:18 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 50 samples, and 50 variants...
----------------------------------------
Global fields:
    'bn': struct {
        n_populations: int32, 
        n_samples: int32, 
        n_variants: int32, 
        n_partitions: int32, 
        pop_dist: array<int32>, 
        fst: array<float64>, 
        mixture: bool
    }
----------------------------------------
Column fields:
    'sample_idx': int32
    'pop': int32
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'ancestral_af': float64
    'af': array<float64>
----------------------------------------
Entry fields:
    'GT': call
----------------------------------------
Column key: ['sample_idx']
Row key: ['locus', 'alleles']
----------------------------------------
2019-04-24 10:21:18 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:18 Hail: INFO: hwe_normalized_pca: running PCA using 50 variants.
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: pca: running PCA with 10 components...
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: hwe_normalized_pca: running PCA using 50 variants.
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:20 Hail: INFO: pca: running PCA with 10 components...
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
----------------------------------------
Global fields:
    'bn': struct {
        n_populations: int32, 
        n_samples: int32, 
        n_variants: int32, 
        n_partitions: int32, 
        pop_dist: array<int32>, 
        fst: array<float64>, 
        mixture: bool
    }
----------------------------------------
Column fields:
    'sample_idx': int32
    'pop': int32
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'ancestral_af': float64
    'af': array<float64>
----------------------------------------
Entry fields:
    'GT': call
----------------------------------------
Column key: ['sample_idx']
Row key: ['locus', 'alleles']
----------------------------------------
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'locus': locus<GRCh37> 
    'alleles': array<str> 
    'loadings': array<float64> 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
| locus         | alleles    | 0.GT | 0.projected                                                                                       |
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
| locus<GRCh37> | array<str> | call | array<float64>                                                                                    |
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
| 1:46          | ["A","C"]  | 0/1  | [1.87e-02,-4.67e-02,-2.28e-01,-1.08e-01,-7.43e-02,9.89e-03,2.69e-02,-3.23e-01,1.42e-01,-7.83e-02] |
| 1:47          | ["A","C"]  | 0/1  | [5.09e-02,-8.10e-02,-2.04e-01,4.71e-02,7.70e-03,-8.92e-04,3.42e-03,-1.63e-01,8.47e-02,-2.53e-01]  |
| 1:48          | ["A","C"]  | 0/0  | [0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,-0.00e+00,-0.00e+00]  |
| 1:49          | ["A","C"]  | 0/0  | [-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00]  |
| 1:50          | ["A","C"]  | 0/0  | [-0.00e+00,-0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00]   |
| 1:51          | ["A","C"]  | 0/1  | NA                                                                                                |
| 1:52          | ["A","C"]  | 0/1  | NA                                                                                                |
| 1:53          | ["A","C"]  | 1/1  | NA                                                                                                |
| 1:54          | ["A","C"]  | 1/1  | NA                                                                                                |
| 1:55          | ["A","C"]  | 0/1  | NA                                                                                                |
| 1:56          | ["A","C"]  | 1/1  | NA                                                                                                |
| 1:57          | ["A","C"]  | 1/1  | NA                                                                                                |
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
showing top 12 rows
showing the first 1 of 50 columns
2019-04-24 10:21:22 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:23 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:23 Hail: INFO: Coerced sorted dataset
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
| locus         | alleles    | 0.GT | 0.projected                                                                                         |
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
| locus<GRCh37> | array<str> | call | array<float64>                                                                                      |
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
| 1:46          | ["A","C"]  | 0/1  | [1.87e-02,-4.67e-02,-2.28e-01,-1.08e-01,-7.43e-02,9.89e-03,2.69e-02,-3.23e-01,1.42e-01,-7.83e-02]   |
| 1:47          | ["A","C"]  | 0/1  | [5.09e-02,-8.10e-02,-2.04e-01,4.71e-02,7.70e-03,-8.92e-04,3.42e-03,-1.63e-01,8.47e-02,-2.53e-01]    |
| 1:48          | ["A","C"]  | 0/0  | [0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,-0.00e+00,-0.00e+00]    |
| 1:49          | ["A","C"]  | 0/0  | [-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00]    |
| 1:50          | ["A","C"]  | 0/0  | [-0.00e+00,-0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00]     |
| 1:51          | ["A","C"]  | 0/1  | [-2.15e-01,-1.44e-01,-9.22e-02,-5.71e-02,1.75e-01,-1.55e-01,3.61e-02,-1.58e-01,-3.50e-01,-1.33e-02] |
| 1:52          | ["A","C"]  | 0/1  | [8.60e-02,1.95e-01,-1.44e-01,1.68e-01,-9.61e-02,8.25e-02,-1.30e-01,2.98e-01,-3.11e-02,7.92e-03]     |
| 1:53          | ["A","C"]  | 1/1  | [-2.80e-01,3.27e-02,-6.19e-02,-4.96e-01,-1.51e-01,-1.11e-01,-1.33e-02,1.94e-01,3.67e-01,1.92e-01]   |
| 1:54          | ["A","C"]  | 1/1  | [4.88e-03,3.18e-02,4.02e-01,-3.91e-02,-6.12e-02,3.79e-01,7.26e-02,6.10e-01,-5.12e-02,5.79e-02]      |
| 1:55          | ["A","C"]  | 0/1  | [-1.22e-01,-1.60e-01,2.32e-01,6.86e-02,-3.77e-02,3.49e-01,-7.54e-02,-1.42e-02,5.03e-02,-8.00e-02]   |
| 1:56          | ["A","C"]  | 1/1  | [3.86e-01,-1.29e-01,3.57e-01,-2.20e-01,7.39e-02,3.87e-01,-1.55e-01,-4.19e-01,3.56e-01,8.60e-02]     |
| 1:57          | ["A","C"]  | 1/1  | [-1.05e-01,-5.41e-01,-2.04e-01,-1.28e-01,-6.17e-02,1.75e-02,-2.59e-02,-3.15e-01,1.03e-01,-9.13e-02] |
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
showing top 12 rows
showing the first 1 of 50 columns

In [29]:

Topic		Replies	Views
Understanding how hwe_normalized_pca works Hail Query & hailctl	2	407	March 26, 2024
Hl.agg.linreg() output question Hail Query & hailctl	2	392	July 16, 2020
Hl.experimental.pc_project and the alleles key Hail Query & hailctl	5	585	August 17, 2020
Annotating variants in a matrix table with 1000genomes database Hail Query & hailctl	0	342	April 20, 2023
Multiallelic in multisample MatrixTable Hail Query & hailctl	5	341	November 29, 2022

PCA to output allele frequencies alongside loadings?

Related topics