Difficulty joining ht to mt

I’m probably being really stupid but I can’t seem to join a HT to a MT.

I have a MT that has a row called “geneIds” amongst many others.

Row fields: 'geneIds': set<str>

I’ve created a tsv file, which I’ve read in as HT that has geneIds and some other metrics:

For the HT:
key: [‘geneIds’]

For the MT:
Column key: [‘s’]
Row key: [‘locus’, ‘alleles’]

Essentially I want to join the HT to the MT so I can filter genotypes on LOEUF score or inheritance class for example. I figure I need to change the column key for the MT to match that of the HT, but the following code fails:


From playing around with the tutorial, something like this should work:
mt.annotate_rows(GenCC=genes[mt.geneIds].GenCC) but it fails, and I assume that’s because the keys don’t marry up.

Any ideas where I am going wrong?! I’ve tried changing the row_keys of the mt to be geneIds which at least changes the row key but that still doesn’t work and I’ve tried annotate_rows and annotate_cols.

Apologies for the basic Q!

Hi @lecb!

Can you share the error messages and full stack traces? It’s hard to give suggestions without knowing exactly what went wrong.

You appear to have a matrix table with variants as rows. Each row is associated with zero or more genes, represented by the row field geneIds. You also have a table keyed by a field called geneIds but which is not a set, it is just a single string. In particular, every row in your table has exactly one gene. Here’s how to bring your data over:

# it is important for speed that we read and then immediately get the rows
mt = hl.read_matrix_table(...)
rows = mt.rows()
rows = rows.select('geneIds')
rows = rows.explode('geneIds')
# we now have one row per variant per gene
rows = rows.select(geneId=rows.geneIds)
rows = rows.annotate(GenCC = genes[rows.geneId])

mt = mt.annotate_rows(
    GenCC = rows.index(mt.row_key, all_matches=True).GenCC

That last step uses Table.index to get every matching record. In particular, your matrix table will have a new field, GenCC, which is an array containing one entry for each gene of which your variant is a member.

1 Like

Amazing, thanks so much! You have to reduce the mt to a 2d table to then join the additional columns before turning in “3D” again. I do have one further Q though if I may for my own understanding…

Why does rows.GenCC.show() look like this:

and `mt.GenCC.show(5) look like this:

Is it because in the mt there is potentially more than one gene per variant? The problem I now have is filtering by inheritance class. I would usually have used tmp=mt.filter_rows(mt.GenCC.inheritance_class == 'recessive')

But if now fails because the structure is more complicated :-(. I assume there is a way to filter for ‘recessive’ genes if any one the genes annotated by the variant is recessive?

For anyone interested, I figured it out!

tmp=mt.filter_rows((mt.GenCC[0].inheritance_class == 'recessive') | (mt.GenCC[1].inheritance_class == 'recessive'))

Yeah, it’s a bit funky when you look at an array of structured records. The missing data doesn’t help. When the entire GenCC is missing (e.g. 138817) that means your geneIds was an empty set. If you’ve an array of missing data that means there’s one or more genes at that variant but there was no information in the genes table.

I think you might find it easier to work with the data if you duplicate your rows one for each involved gene. You don’t want to write out such a matrix table (you duplicate the big genotype data unnecessarily) but you can use it to calculate some statistics.

mt = mt.explode_rows(mt.GenCC)

EDIT: Specifically, the reason for the difference is that the table has one row per-gene-per-variant whereas the matrix table has. one row per-variant. As a result, we have to pack all the per-gene information into a variable length array.

1 Like