Making MatrixTable into a Table

I have a matrix table (MT) with multiple row fields (userID, A,B,C) and columns fields (X,Y,Z) but only one entry field. I’d like to change this matrix table into a regular table with userID as the row name and columns (A,B,C,X,Y,Z). Is there a way to do this? Thanks!

Hey @Mark,

You’re looking for make_table

Thanks.

Is there a certain size at which a MatrixTable becomes too large for this function to work effectively? I tried using it on a very large table, and I got an error saying, “Maximum recursion depth exceeded.”

This will definitely have an upper bound of effectiveness. What are your downstream applications for such a table? If it’s possible to write these applications to work against the matrix table, they’re going to be much more efficient.

Okay, I will try that. I’m basically trying to run a PheWAS. I have one MatrixTable (mtA) with one subject per row. Each column in this MatrixTable is a phenotype with a number of attributes. I have another MatrixTable (mtB) with one variant per row, and one subject per column. I’m trying to combine these two MatrixTables so that I can run regressions. I’m struggling because the subjects are in the rows of mtA and the columns of mtB. Is there a way I can combine these two MatrixTables into one? Then, I can filter it as needed and run regressions.

Basically, what I’m trying to accomplish now, I think, is to annotate the rows of one MatrixTable with entries (indexed by column) from another.

Hey @Mark!

Great question. Hail currently makes this harder than it should. The best we can do currently is to localize your phenotypes as a column field dictionary.

You can run regressions using mtB.phenos.values() as a vector of independent variables.

import hail as hl

mtB = hl.balding_nichols_model(1, 5, 5)
mtA = mtB.cols()
mtA = mtA.annotate(ldl=10, hdl=12, height=60)
mtA = mtA.to_matrix_table_row_major(columns=['ldl', 'hdl', 'height'], entry_field_name='value', col_field_name='pheno')

mtB.show()
mtA.show()

# This doesn't work in Hail currently:
# mtA = mtA.annotate_cols(
#     phenotype_vector = hl.agg.collect(mtB[mtA.col_key, :].phenotype))
#
# Instead:
mtA = mtA.annotate_rows(phenos = hl.dict(hl.agg.collect((mtA.pheno, mtA.value))))
mtB = mtB.annotate_cols(
    phenos = mtA.rows()[mtB.col_key].phenos)

mtB.cols().show()

There appears to be a bug in show of a dict, but the data is fine:

mtB:

+---------------+------------+------+------+------+------+------+
| locus         | alleles    | 0.GT | 1.GT | 2.GT | 3.GT | 4.GT |
+---------------+------------+------+------+------+------+------+
| locus<GRCh37> | array<str> | call | call | call | call | call |
+---------------+------------+------+------+------+------+------+
| 1:1           | ["A","C"]  | 0/1  | 0/0  | 0/0  | 0/0  | 0/1  |
| 1:2           | ["A","C"]  | 0/0  | 0/1  | 1/1  | 1/1  | 1/1  |
| 1:3           | ["A","C"]  | 0/0  | 0/1  | 1/1  | 0/1  | 0/1  |
| 1:4           | ["A","C"]  | 1/1  | 0/1  | 1/1  | 1/1  | 1/1  |
| 1:5           | ["A","C"]  | 1/1  | 1/1  | 0/1  | 0/1  | 1/1  |
+---------------+------------+------+------+------+------+------+

mtA:

+------------+-------------+-------------+----------------+
| sample_idx | 'ldl'.value | 'hdl'.value | 'height'.value |
+------------+-------------+-------------+----------------+
|      int32 |       int32 |       int32 |          int32 |
+------------+-------------+-------------+----------------+
|          0 |          10 |          12 |             60 |
|          1 |          10 |          12 |             60 |
|          2 |          10 |          12 |             60 |
|          3 |          10 |          12 |             60 |
|          4 |          10 |          12 |             60 |
+------------+-------------+-------------+----------------+

mtA.phenos:

+------------+-------+------------------------------------+
| sample_idx |   pop | phenos                             |
+------------+-------+------------------------------------+
|      int32 | int32 | dict<str, int32>                   |
+------------+-------+------------------------------------+
|          0 |     0 | {"hdl":12},"height":60},"ldl":10}} |
|          1 |     0 | {"hdl":12},"height":60},"ldl":10}} |
|          2 |     0 | {"hdl":12},"height":60},"ldl":10}} |
|          3 |     0 | {"hdl":12},"height":60},"ldl":10}} |
|          4 |     0 | {"hdl":12},"height":60},"ldl":10}} |
+------------+-------+------------------------------------+

Thanks. This isn’t working for me, either. My phenotypes are labeled with multiple parameters, not just one name ‘pheno’, and I have three entry fields.

Here’s MatrixTable MT:

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    'trait_type': str
    'phenocode': str
    'pheno_sex': str
    'coding': str
    'modifier': str
    'n_cases_both_sexes': int64
    'n_cases_females': int64
    'n_cases_males': int64
    'description': str
    'description_more': str
    'coding_description': str
    'category': str
----------------------------------------
Row fields:
    'userId': int32
    'PC1': float64
    'PC2': float64
    'PC3': float64
    'PC4': float64
    'PC5': float64
    'PC6': float64
    'PC7': float64
    'PC8': float64
    'PC9': float64
    'PC10': float64
    'PC11': float64
    'PC12': float64
    'PC13': float64
    'PC14': float64
    'PC15': float64
    'PC16': float64
    'PC17': float64
    'PC18': float64
    'PC19': float64
    'PC20': float64
    'pop': str
    'related': bool
    'age': int32
    'sex': int32
    'age_sex': int32
    'age2': int32
    'age2_sex': int32
----------------------------------------
Entry fields:
    'both_sexes': float64
    'females': float64
    'males': float64
----------------------------------------
Column key: ['trait_type', 'phenocode', 'pheno_sex', 'coding', 'modifier']
Row key: ['userId']
----------------------------------------

I also have MatrixTable bgenmt:

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'varid': str
----------------------------------------
Entry fields:
    'GP': array<float64>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

userId in mt is the same as s in bgenmt. Any help combining these two objects would be helpful.

userId is an integer and s is a string, so you’ll need to convert one of them to the other type. You can use hl.str or hl.int for this.

You can use as many fields as you’d like in the key of the dictionary and in the value of the dictionary:

mtA = mtA.annotate_rows(phenos = hl.dict(hl.agg.collect(
    (hl.struct(trait_type=mtA.trait_type,  # the first struct is the key
               phenocode=mtA.phenocode,
               pheno_six=mt.pheno_sex, ...),
     hl.struct(both_sexes=mtA.both_sexes,  # the second struct is the value
               females=mtA.females,
               males=mtA.males))))))

You can run regressions on mtB.phenos.values().both_sexes, ....females and ....males. In hail, if you have an array of structs, you can pick out any one field:

arr = hl.array([hl.struct(height=60, weight=150),
                hl.struct(height=50, weight=120)])
just_heights = arr.height  # equal to hl.array([60, 50])
just_weights = arr.weight # equal to hl.array([150, 120])

Thanks. Another question.

I added a column field ‘dosage’ to the MatrixTable MT above. I would like to ‘annotate_cols’ with the sum of ‘dosage’ for each phenotype removing rows for which the entry ‘both_sexes’ is missing.

In other words, I would like to calculate the total dosage for each phenotype only including people for whom the phenotype is not missing.

How should I write this? I’m having trouble. Thanks!

I tried something like this:

mt = mt.annotate_cols(n_homozygotes = hl.agg.sum(mt.dosage))

But this produces the same value for all phenotypes and does not remove individuals missing any given phenotype.

Aha. Maybe this will work.

mt = mt.annotate_cols(n_homozygotes = hl.agg.filter(hl.is_defined(mt.both_sexes), hl.agg.sum(mt.dosage)))