Filter_rows using row_key of a different source


#1

I want to apply a filter on the variants that is based on only for a part of the samples.
I first want to create a new mt that is filtered cols, and after that on rows, such that the filter on rows is only based on part of the samples. Then use the row_ key (locus) of this new mt to select for variants of the initial mt.

mt 1 is initial matrixtable, and create new mt2 by filtering samples for controls.

> mt1
> mt2 = mt.filter_cols(mt.condition == "control")

For mt2 filter variants for hardy-weinberg p-value.

mt2 = mt2.filter_rows(mt1.variant_qc.p_value_hwe < 10**-5)

Ensure that mt2 has row key locus just like mt1.

mt2 = mt2.key_rows_by("locus")

I want to use the key of mt2 to select variants of mt1. Thus, keep variants that pass hwe based on only the control samples and not cases.

idx = mt2.row.select("locus")
mt1 = mt.filter_rows(mt1.locus == idx)  

TypeError: Invalid ‘==’ comparison, cannot compare expressions of type ‘locus<GRCh37>’ and 'struct{locus: locus<GRCh37>}'


#2

You’re looking for a combination of the “join” syntax and the rows() table:

mt2 = mt2.key_rows_by("locus")
mt1 = mt1.filter_rows(hl.is_defined(mt2.rows()[mt1.locus]))

mt2.rows() says, give me the rows table (the variants table, without genotypes) of mt2.

mt2.rows()[mt1.locus] is a “join”, which finds the row in mt2.rows() such that it’s row key is equal to mt1.locus. If no such matching row exists, mt2.rows()[mt1.locus] is a missing value.

hl.is_defined(x) checks if something is not a missing value.


#3

There’s a new nicer syntax for this:

mt1 = mt1.filter_rows(hl.is_defined(mt2.rows()[mt1.locus]))

It’s called a “semi join”:

mt1 = mt1.semi_join_rows(mt2.rows())

#6

@danking Thanks, that works!

@tpoterba

mt1 = mt1.semi_join_rows(mt2.rows())

Gives the following error:
AttributeError: MatrixTable instance has no field, method, or property 'semi_join_rows. Did you mean: MatrixTable method: 'union_rows’

When I use ‘union_rows’ instead:
hail.matrixtable.MatrixTable, found hail.table.Table: <hail.table.Table object at 0x7fbec5d6a4a8>'
However, both mt1 and mt2 are typed as MatrixTable.

What is going wrong here?


#7

semi_join_rows was added just a couple days ago, you’ll need version 0.2.11 or later


#8

@tpoterba Great, thanks!