Variant annotation in MatrixTable

Hi,
I have a table ( ht ), see ht.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'chr': str
    'start': int32
    'end': int32
    'range': str
    'window_id': str
    'gene_name': str
    'n_variants': int32
    'gene_id': str
    'interval': interval<locus<GRCh38>>
----------------------------------------
Key: ['interval']

I want to annotate my variants (in my mt ) and create a structure rows filed which will have all the information. One variant could overlap with multiple intervals, and hence I probably need to create array<struct{}
I am not sure how to do that. Could anybody help me please ?

The familiar syntax below is almost what you want:

mt = mt.annotate_rows(interval_data = ht[mt.row_key])

However, this will give you one distinct struct per row of the matrix table. Instead, you can call through to the Table.index method, and use the all_matches flag:

mt = mt.annotate_rows(interval_data = ht.index(mt.row_key, all_matches=True))

Thank you very much @tpoterba. It was exactly what I want.

I am having another issue now -

my ht looks like -

I have a variant chr1:948594 in my mt, which was surprisingly not annotated with the ht information, althouth the position is present in the ht, - see the last line - the end position is the actual variant position.

Here is my code -

mt = mt.annotate_rows(windows = ht.index(mt.row_key[0], all_matches=True))   
mt = mt.rows()
mt.select(mt.windows).show()

Which gives me the output like -

As you can see the last variant was not correctly annotated.

The printed representation of these intervals indicates using interval notation that they are inclusive of the left endpoint, but exclusive of the right endpoint. You’ll need to re-create these intervals to include the right endpoint, if you want chr1:946174-chr1:948594 to include chr1:948594.

Ahhh Okie. Thanks Tim.

Hi,
I have a question regarding Table.index

I have a Hail table (ht)
ht.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'CHR': int32 
    'START': int32 
    'END': int32 
    'Window_boundary': str 
    'Window_Name': str 
    'Gene_ID': str 
    'No_of_Variants': int32 
    'Range': str 
    'interval': interval<locus<GRCh38>> 
----------------------------------------
Key: ['interval', 'Gene_ID']
----------------------------------------

Now I need to annotate my mt
The key of my mt is -

----------------------------------------
Column key: ['s']
Row key: ['locus', 'gene_set']
----------------------------------------

As @tpoterba mentioned earlier to this thread, I am trying this -

mt = mt.annotate_rows(interval_data = ht.index(mt.row_key, all_matches=True))

But it gives me an Error mentioned below -

Error:

ExpressionException: Key type mismatch: cannot index table with given expressions:
  Table key:         interval<locus<GRCh38>>, str
  Index Expressions: locus<GRCh38>, str

However,

mt = mt.annotate_rows(interval_data = ht.index(mt.row_key[0], all_matches=True))

works. But I need to match both condition (i) the variants (mt.locus) should be present in the interval (ht.interval) and (ii) for matching genes, the mt.gene_set should match with ht.Gene_ID.
Note that my mt already have some duplicated mt.locus with different mt.gene_set but the key is unique.

Could you please help me. Thanks

I think the solution here is to take gene_set out of the table key.

But then how do I match both the conditions - locus and gene_set ?

And also mt.locus is repeated. Same locus for multiple genes !

I see. Interval joins are a bit special, and only are possible when joining interval to , and don’t work when you have additional key fields after the interval.

I think what I’d do is (1) key the interval table by only the interval, and (b) filter the result after:

mt = mt.annotate_rows(interval_matches = 
  interval_ht.index(mt.locus, all_matches=True).filter(
    lambda x: x.gene_set == mt.gene_set))

I am not sure, this will give me what I want -
Let’s take an example -
chr1:1000 is a LoF variant for GeneA and GeneB, and chr1:1100 is a LoF variant for GeneC

The interval file says -

chr1:100-2000 GeneA rank1
chr1:100-2000 GeneA rank2
chr1:100-2000 GeneB rank1
chr1:100-2000 GeneC rank1

I want to annotate my MatrixTable in a way -

chr1:1000 [GeneA, GeneB]  ## this we get from all_matches=True
chr1:1100 [GeneC]

But what you proposed, wouldn’t I get

chr1:1000 [GeneA, GeneB, GeneC]  
chr1:1100 [GeneA, GeneB, GeneC]

Eventually, I want to explode the Gene, and then chr1:1000 will be annotated with GeneC, and chr1:1100 will be annotated with GeneA and GeneB which are wrong.
Sorry I am being very complicated :frowning:

Ahh wait -
So what you proposed @tpoterba - it actually filter Genes right
So with the above mentioned example -
it gives -

chr1:1000 [GeneA, GeneB]  ## this we get from all_matches=True
chr1:1100 [GeneC]

Right ??

yep, exactly.

Thank you very much @tpoterba for the confirmation