Creating struct from key-value pairs

I have a table that contains exon inclusion estimates from RNA-Seq with the columns “event” (exon), “sample” (the individual), “tissue”, and “PSI” (exon inclusion ratio).

I would like to create a matrix table in which the rows are the events, the columns are the samples, and each entry contains a struct where the fields are the tissues and the values are the PSI measurements.

The best I was able to do was mt = ht.to_matrix_table(["EVENT"], ["sample"]), but that gives me the following schema:

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    'sample': str
----------------------------------------
Row fields:
    'EVENT': str
----------------------------------------
Entry fields:
    'PSI': float64
    'tissue': str
----------------------------------------
Column key: ['sample']
Row key: ['EVENT']

How can I convert the entry fields to a struct?

Many thanks!

to_matrix_table assumes a unique mapping from row and column fields to entries, which it shouldn’t (sorry!). For now, you can do this instead:

In [28]: t = hl.utils.range_table(27) 
    ...:  
    ...: t = t.key_by(event='exon_' + hl.str(t.idx // 9), sample='sample_' + hl.str(t.idx // 3 % 3), tissue='tissue_' + hl.str(t.idx % 3), psi=t.idx).drop('idx') 
    ...:  
    ...: t.show() 
    ...:  
    ...: all_tissues = t.aggregate(hl.agg.collect_as_set(t.tissue)) 
    ...:  
    ...: mt = t.to_matrix_table(['event'], ['sample', 'tissue']) 
    ...: mt = mt.group_cols_by(mt.sample).aggregate(tissue_to_psi_dict = hl.dict(hl.agg.collect((mt.tissue, mt.psi)))) 
    ...: mt = mt.select_entries(**{ 
    ...:     tissue_name: mt.tissue_to_psi_dict[tissue_name] for tissue_name in all_tissues 
    ...: }) 
    ...: mt.show() 
    ...:                                                                                                                                                                                       
+----------+------------+------------+-------+
| event    | sample     | tissue     |   psi |
+----------+------------+------------+-------+
| str      | str        | str        | int32 |
+----------+------------+------------+-------+
| "exon_0" | "sample_0" | "tissue_0" |     0 |
| "exon_0" | "sample_0" | "tissue_1" |     1 |
| "exon_0" | "sample_0" | "tissue_2" |     2 |
| "exon_0" | "sample_1" | "tissue_0" |     3 |
| "exon_0" | "sample_1" | "tissue_1" |     4 |
| "exon_0" | "sample_1" | "tissue_2" |     5 |
| "exon_0" | "sample_2" | "tissue_0" |     6 |
| "exon_0" | "sample_2" | "tissue_1" |     7 |
| "exon_0" | "sample_2" | "tissue_2" |     8 |
| "exon_1" | "sample_0" | "tissue_0" |     9 |
| "exon_1" | "sample_0" | "tissue_1" |    10 |
| "exon_1" | "sample_0" | "tissue_2" |    11 |
| "exon_1" | "sample_1" | "tissue_0" |    12 |
| "exon_1" | "sample_1" | "tissue_1" |    13 |
| "exon_1" | "sample_1" | "tissue_2" |    14 |
| "exon_1" | "sample_2" | "tissue_0" |    15 |
| "exon_1" | "sample_2" | "tissue_1" |    16 |
| "exon_1" | "sample_2" | "tissue_2" |    17 |
| "exon_2" | "sample_0" | "tissue_0" |    18 |
| "exon_2" | "sample_0" | "tissue_1" |    19 |
| "exon_2" | "sample_0" | "tissue_2" |    20 |
| "exon_2" | "sample_1" | "tissue_0" |    21 |
| "exon_2" | "sample_1" | "tissue_1" |    22 |
| "exon_2" | "sample_1" | "tissue_2" |    23 |
| "exon_2" | "sample_2" | "tissue_0" |    24 |
| "exon_2" | "sample_2" | "tissue_1" |    25 |
| "exon_2" | "sample_2" | "tissue_2" |    26 |
+----------+------------+------------+-------+
2021-05-19 18:06:46 Hail: INFO: Coerced sorted dataset
2021-05-19 18:06:46 Hail: INFO: Coerced dataset with out-of-order partitions.
+----------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
| event    | 'sample_0'.tissue_2 | 'sample_0'.tissue_1 | 'sample_0'.tissue_0 | 'sample_1'.tissue_2 | 'sample_1'.tissue_1 | 'sample_1'.tissue_0 | 'sample_2'.tissue_2 | 'sample_2'.tissue_1 |
+----------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
| str      |               int32 |               int32 |               int32 |               int32 |               int32 |               int32 |               int32 |               int32 |
+----------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
| "exon_0" |                   2 |                   1 |                   0 |                   5 |                   4 |                   3 |                   8 |                   7 |
| "exon_1" |                  11 |                  10 |                   9 |                  14 |                  13 |                  12 |                  17 |                  16 |
| "exon_2" |                  20 |                  19 |                  18 |                  23 |                  22 |                  21 |                  26 |                  25 |
+----------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+

+---------------------+
| 'sample_2'.tissue_0 |
+---------------------+
|               int32 |
+---------------------+
|                   6 |
|                  15 |
|                  24 |
+---------------------+

1 Like