Hi,
I have a table ( ht
), see ht.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'chr': str
'start': int32
'end': int32
'range': str
'window_id': str
'gene_name': str
'n_variants': int32
'gene_id': str
'interval': interval<locus<GRCh38>>
----------------------------------------
Key: ['interval']
I want to annotate my variants (in my mt
) and create a structure rows filed which will have all the information. One variant could overlap with multiple intervals, and hence I probably need to create array<struct{}
I am not sure how to do that. Could anybody help me please ?
The familiar syntax below is almost what you want:
mt = mt.annotate_rows(interval_data = ht[mt.row_key])
However, this will give you one distinct struct per row of the matrix table. Instead, you can call through to the Table.index
method, and use the all_matches
flag:
mt = mt.annotate_rows(interval_data = ht.index(mt.row_key, all_matches=True))
Thank you very much @tpoterba. It was exactly what I want.
I am having another issue now -
my ht
looks like -
I have a variant chr1:948594
in my mt
, which was surprisingly not annotated with the ht
information, althouth the position is present in the ht
, - see the last line - the end position is the actual variant position.
Here is my code -
mt = mt.annotate_rows(windows = ht.index(mt.row_key[0], all_matches=True))
mt = mt.rows()
mt.select(mt.windows).show()
Which gives me the output like -
As you can see the last variant was not correctly annotated.
The printed representation of these intervals indicates using interval notation that they are inclusive of the left endpoint, but exclusive of the right endpoint. You’ll need to re-create these intervals to include the right endpoint, if you want chr1:946174-chr1:948594
to include chr1:948594
.
Hi,
I have a question regarding Table.index
I have a Hail table (ht
)
ht.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Row fields:
'CHR': int32
'START': int32
'END': int32
'Window_boundary': str
'Window_Name': str
'Gene_ID': str
'No_of_Variants': int32
'Range': str
'interval': interval<locus<GRCh38>>
----------------------------------------
Key: ['interval', 'Gene_ID']
----------------------------------------
Now I need to annotate my mt
The key of my mt
is -
----------------------------------------
Column key: ['s']
Row key: ['locus', 'gene_set']
----------------------------------------
As @tpoterba mentioned earlier to this thread, I am trying this -
mt = mt.annotate_rows(interval_data = ht.index(mt.row_key, all_matches=True))
But it gives me an Error mentioned below -
Error:
ExpressionException: Key type mismatch: cannot index table with given expressions:
Table key: interval<locus<GRCh38>>, str
Index Expressions: locus<GRCh38>, str
However,
mt = mt.annotate_rows(interval_data = ht.index(mt.row_key[0], all_matches=True))
works. But I need to match both condition (i) the variants (mt.locus
) should be present in the interval (ht.interval
) and (ii) for matching genes, the mt.gene_set
should match with ht.Gene_ID
.
Note that my mt
already have some duplicated mt.locus
with different mt.gene_set
but the key is unique.
Could you please help me. Thanks
I think the solution here is to take gene_set
out of the table key.
But then how do I match both the conditions - locus and gene_set ?
And also mt.locus
is repeated. Same locus for multiple genes !
I see. Interval joins are a bit special, and only are possible when joining interval to , and don’t work when you have additional key fields after the interval.
I think what I’d do is (1) key the interval table by only the interval, and (b) filter the result after:
mt = mt.annotate_rows(interval_matches =
interval_ht.index(mt.locus, all_matches=True).filter(
lambda x: x.gene_set == mt.gene_set))
I am not sure, this will give me what I want -
Let’s take an example -
chr1:1000
is a LoF variant for GeneA and GeneB, and chr1:1100
is a LoF variant for GeneC
The interval file says -
chr1:100-2000 GeneA rank1
chr1:100-2000 GeneA rank2
chr1:100-2000 GeneB rank1
chr1:100-2000 GeneC rank1
I want to annotate my MatrixTable in a way -
chr1:1000 [GeneA, GeneB] ## this we get from all_matches=True
chr1:1100 [GeneC]
But what you proposed, wouldn’t I get
chr1:1000 [GeneA, GeneB, GeneC]
chr1:1100 [GeneA, GeneB, GeneC]
Eventually, I want to explode the Gene, and then chr1:1000
will be annotated with GeneC, and chr1:1100
will be annotated with GeneA and GeneB which are wrong.
Sorry I am being very complicated
Ahh wait -
So what you proposed @tpoterba - it actually filter Genes right
So with the above mentioned example -
it gives -
chr1:1000 [GeneA, GeneB] ## this we get from all_matches=True
chr1:1100 [GeneC]
Right ??
Thank you very much @tpoterba for the confirmation