Hi @dmoracze ,
I think you’ve said you have two matrix tables:
- geno, rows are variants, columns are samples, entries are genotypes
- phenos, rows are samples, columns are phenotypes, entries are measurements of said phenotype
I think you want one matrix table:
- geno, rows are variants, columns are samples, entries are genotypes, and one sample annotation for every phenotype
I created sample data to match your data. My phenos file looks like this:
sample_id,weight,height
abc123,150,72
def456,135,60
ghi789,170,66
I think this does what you want, but it’s a bit complicated:
In [1]: import hail as hl
...:
...: mt = hl.import_bgen('/tmp/geno.bgen', entry_fields=['GP'])
...: mt.describe()
...:
...: ph = hl.import_matrix_table('/tmp/phenos', delimiter=',', row_fields={'sample_id': hl.tstr}, row_key=['sample_id'])
...: ph.describe()
...:
...: # collect all phenotypes as a struct
...: all_phenotypes = ph.aggregate_cols(hl.agg.collect(ph.col_id))
...: ph = ph.annotate_rows(phenos_dict = hl.dict(hl.agg.collect((ph.col_id, ph.x))))
...: ph = ph.annotate_rows(**{
...: phenotype: ph.phenos_dict[phenotype]
...: for phenotype in all_phenotypes
...: })
...:
...: # annotate phenotypes into the column fields of genos
...: mt = mt.annotate_cols(**ph.rows()[mt.s])
...: mt.cols().show()
...: mt.show()
2021-05-24 16:30:11 Hail: INFO: Number of BGEN files parsed: 1
2021-05-24 16:30:11 Hail: INFO: Number of samples in BGEN files: 3
2021-05-24 16:30:11 Hail: INFO: Number of variants across all BGEN files: 10
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'rsid': str
'varid': str
----------------------------------------
Entry fields:
'GP': array<float64>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
'col_id': str
----------------------------------------
Row fields:
'sample_id': str
----------------------------------------
Entry fields:
'x': int32
----------------------------------------
Column key: ['col_id']
Row key: ['sample_id']
----------------------------------------
2021-05-24 16:30:11 Hail: INFO: Coerced sorted dataset
2021-05-24 16:30:11 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2021-05-24 16:30:11 Hail: INFO: Coerced sorted dataset
+----------+----------------------------+--------+--------+
| s | phenos_dict | height | weight |
+----------+----------------------------+--------+--------+
| str | dict<str, int32> | int32 | int32 |
+----------+----------------------------+--------+--------+
| "abc123" | {"height":72,"weight":150} | 72 | 150 |
| "def456" | {"height":60,"weight":135} | 60 | 135 |
| "ghi789" | {"height":66,"weight":170} | 66 | 170 |
+----------+----------------------------+--------+--------+
+---------------+--------------------------------+------------------------------+------------------------------+------------------------------+
| locus | alleles | 'abc123'.GP | 'def456'.GP | 'ghi789'.GP |
+---------------+--------------------------------+------------------------------+------------------------------+------------------------------+
| locus<GRCh37> | array<str> | array<float64> | array<float64> | array<float64> |
+---------------+--------------------------------+------------------------------+------------------------------+------------------------------+
| 1:10177 | ["A","AC"] | [0.00e+00,1.00e+00,0.00e+00] | [0.00e+00,1.00e+00,0.00e+00] | [0.00e+00,1.00e+00,0.00e+00] |
| 1:10235 | ["T","TA"] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10352 | ["T","TA"] | [0.00e+00,1.00e+00,0.00e+00] | [0.00e+00,1.00e+00,0.00e+00] | [0.00e+00,1.00e+00,0.00e+00] |
| 1:10505 | ["A","T"] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10506 | ["C","G"] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10511 | ["G","A"] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10539 | ["C","A"] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10542 | ["C","T"] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10579 | ["C","A"] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10616 | ["CCGCCGTTGCAAAGGCGCGCCG","C"] | [0.00e+00,0.00e+00,1.00e+00] | [0.00e+00,0.00e+00,1.00e+00] | [0.00e+00,0.00e+00,1.00e+00] |
+---------------+--------------------------------+------------------------------+------------------------------+------------------------------+
I think you should import your phenos as a table, not a matrix table. The code for that is much simpler:
In [1]: import hail as hl
...:
...: mt = hl.import_bgen('/tmp/geno.bgen', entry_fields=['GP'])
...: mt.describe()
...:
...:
...: ph = hl.import_table('/tmp/phenos', delimiter=',', impute=True, key='sample_id')
...: ph.describe()
...:
...: # annotate phenotypes into the column fields of genos
...: mt = mt.annotate_cols(**ph[mt.s])
...: mt.cols().show()
...: mt.show()