If you have two separate tables with the same sample ids but disjoint variants, something like the following might be what you need.
I generate two random datasets, mt1
and mt2
. This model will generate the same loci for each dataset, so I shift the second dataset’s locus positions such that the datasets’ loci are disjoint. I run PCA separately for each dataset (this might not be well founded from a statistical point of view, but it demonstrates having two separate sources of loadings). This produces two tables of loadings, loadings1
and loadings2
. I annotate the first loadings first (nb: in reality you should mean center and variance normalize the GTs first, see pc_project line 132). As you can see from the show
the projected genotypes from the first dataset have real values and the projected genotypes from the second dataset (position>50) have NA
. Finally, I project the remaining genotypes using loadings2
.
Upon a second reading, I think I have not addressed your question directly! If you can clarify a bit I can give a better example .
In [28]: import hail as hl
...:
...: mt1 = hl.balding_nichols_model(3, 50, 50)
...: mt2 = hl.balding_nichols_model(3, 50, 50)
...: mt1.describe()
...: mt2 = mt2.key_rows_by(locus=hl.locus(mt2.locus.contig, mt2.locus.position+50), alleles=mt2.alleles)
...: _, _, loadings1 = hl.hwe_normalized_pca(mt1.GT, k=10, compute_loadings=True)
...: _, _, loadings2 = hl.hwe_normalized_pca(mt2.GT, k=10, compute_loadings=True)
...: union_mt = mt1.union_rows(mt2)
...: union_mt.describe()
...: loadings1.describe()
...: union_mt = union_mt.annotate_entries(projected = loadings1[union_mt.row_key].loadings * union_mt.GT.n_alt_alleles())
...: union_mt.filter_rows(union_mt.locus.position > 45).show(n_cols=1)
...: union_mt = union_mt.annotate_entries(projected = hl.or_else(union_mt.projected, loadings2[union_mt.row_key].loadings * union_mt.GT.n
...: _alt_alleles()))
...: union_mt.filter_rows(union_mt.locus.position > 45).show(n_cols=1)
...:
2019-04-24 10:21:18 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 50 samples, and 50 variants...
2019-04-24 10:21:18 Hail: INFO: balding_nichols_model: generating genotypes for 3 populations, 50 samples, and 50 variants...
----------------------------------------
Global fields:
'bn': struct {
n_populations: int32,
n_samples: int32,
n_variants: int32,
n_partitions: int32,
pop_dist: array<int32>,
fst: array<float64>,
mixture: bool
}
----------------------------------------
Column fields:
'sample_idx': int32
'pop': int32
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'ancestral_af': float64
'af': array<float64>
----------------------------------------
Entry fields:
'GT': call
----------------------------------------
Column key: ['sample_idx']
Row key: ['locus', 'alleles']
----------------------------------------
2019-04-24 10:21:18 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:18 Hail: INFO: hwe_normalized_pca: running PCA using 50 variants.
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: pca: running PCA with 10 components...
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:19 Hail: INFO: hwe_normalized_pca: running PCA using 50 variants.
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:20 Hail: INFO: pca: running PCA with 10 components...
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:20 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
----------------------------------------
Global fields:
'bn': struct {
n_populations: int32,
n_samples: int32,
n_variants: int32,
n_partitions: int32,
pop_dist: array<int32>,
fst: array<float64>,
mixture: bool
}
----------------------------------------
Column fields:
'sample_idx': int32
'pop': int32
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'ancestral_af': float64
'af': array<float64>
----------------------------------------
Entry fields:
'GT': call
----------------------------------------
Column key: ['sample_idx']
Row key: ['locus', 'alleles']
----------------------------------------
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'loadings': array<float64>
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:21 Hail: INFO: Coerced sorted dataset
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
| locus | alleles | 0.GT | 0.projected |
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
| locus<GRCh37> | array<str> | call | array<float64> |
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
| 1:46 | ["A","C"] | 0/1 | [1.87e-02,-4.67e-02,-2.28e-01,-1.08e-01,-7.43e-02,9.89e-03,2.69e-02,-3.23e-01,1.42e-01,-7.83e-02] |
| 1:47 | ["A","C"] | 0/1 | [5.09e-02,-8.10e-02,-2.04e-01,4.71e-02,7.70e-03,-8.92e-04,3.42e-03,-1.63e-01,8.47e-02,-2.53e-01] |
| 1:48 | ["A","C"] | 0/0 | [0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,-0.00e+00,-0.00e+00] |
| 1:49 | ["A","C"] | 0/0 | [-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00] |
| 1:50 | ["A","C"] | 0/0 | [-0.00e+00,-0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00] |
| 1:51 | ["A","C"] | 0/1 | NA |
| 1:52 | ["A","C"] | 0/1 | NA |
| 1:53 | ["A","C"] | 1/1 | NA |
| 1:54 | ["A","C"] | 1/1 | NA |
| 1:55 | ["A","C"] | 0/1 | NA |
| 1:56 | ["A","C"] | 1/1 | NA |
| 1:57 | ["A","C"] | 1/1 | NA |
+---------------+------------+------+---------------------------------------------------------------------------------------------------+
showing top 12 rows
showing the first 1 of 50 columns
2019-04-24 10:21:22 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:23 Hail: INFO: Coerced sorted dataset
2019-04-24 10:21:23 Hail: INFO: Coerced sorted dataset
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
| locus | alleles | 0.GT | 0.projected |
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
| locus<GRCh37> | array<str> | call | array<float64> |
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
| 1:46 | ["A","C"] | 0/1 | [1.87e-02,-4.67e-02,-2.28e-01,-1.08e-01,-7.43e-02,9.89e-03,2.69e-02,-3.23e-01,1.42e-01,-7.83e-02] |
| 1:47 | ["A","C"] | 0/1 | [5.09e-02,-8.10e-02,-2.04e-01,4.71e-02,7.70e-03,-8.92e-04,3.42e-03,-1.63e-01,8.47e-02,-2.53e-01] |
| 1:48 | ["A","C"] | 0/0 | [0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,0.00e+00,-0.00e+00,-0.00e+00,-0.00e+00] |
| 1:49 | ["A","C"] | 0/0 | [-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00,0.00e+00] |
| 1:50 | ["A","C"] | 0/0 | [-0.00e+00,-0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,0.00e+00,-0.00e+00,0.00e+00,-0.00e+00] |
| 1:51 | ["A","C"] | 0/1 | [-2.15e-01,-1.44e-01,-9.22e-02,-5.71e-02,1.75e-01,-1.55e-01,3.61e-02,-1.58e-01,-3.50e-01,-1.33e-02] |
| 1:52 | ["A","C"] | 0/1 | [8.60e-02,1.95e-01,-1.44e-01,1.68e-01,-9.61e-02,8.25e-02,-1.30e-01,2.98e-01,-3.11e-02,7.92e-03] |
| 1:53 | ["A","C"] | 1/1 | [-2.80e-01,3.27e-02,-6.19e-02,-4.96e-01,-1.51e-01,-1.11e-01,-1.33e-02,1.94e-01,3.67e-01,1.92e-01] |
| 1:54 | ["A","C"] | 1/1 | [4.88e-03,3.18e-02,4.02e-01,-3.91e-02,-6.12e-02,3.79e-01,7.26e-02,6.10e-01,-5.12e-02,5.79e-02] |
| 1:55 | ["A","C"] | 0/1 | [-1.22e-01,-1.60e-01,2.32e-01,6.86e-02,-3.77e-02,3.49e-01,-7.54e-02,-1.42e-02,5.03e-02,-8.00e-02] |
| 1:56 | ["A","C"] | 1/1 | [3.86e-01,-1.29e-01,3.57e-01,-2.20e-01,7.39e-02,3.87e-01,-1.55e-01,-4.19e-01,3.56e-01,8.60e-02] |
| 1:57 | ["A","C"] | 1/1 | [-1.05e-01,-5.41e-01,-2.04e-01,-1.28e-01,-6.17e-02,1.75e-02,-2.59e-02,-3.15e-01,1.03e-01,-9.13e-02] |
+---------------+------------+------+-----------------------------------------------------------------------------------------------------+
showing top 12 rows
showing the first 1 of 50 columns
In [29]: