I want to apply a filter on the variants that is based on only for a part of the samples.
I first want to create a new mt that is filtered cols, and after that on rows, such that the filter on rows is only based on part of the samples. Then use the row_ key (locus) of this new mt to select for variants of the initial mt.
mt 1 is initial matrixtable, and create new mt2 by filtering samples for controls.
> mt1
> mt2 = mt.filter_cols(mt.condition == "control")
For mt2 filter variants for hardy-weinberg p-value.
mt2 = mt2.filter_rows(mt1.variant_qc.p_value_hwe < 10**-5)
Ensure that mt2 has row key locus just like mt1.
mt2 = mt2.key_rows_by("locus")
I want to use the key of mt2 to select variants of mt1. Thus, keep variants that pass hwe based on only the control samples and not cases.
idx = mt2.row.select("locus")
mt1 = mt.filter_rows(mt1.locus == idx)
TypeError: Invalid ‘==’ comparison, cannot compare expressions of type ‘locus<GRCh37>’ and 'struct{locus: locus<GRCh37>}'
You’re looking for a combination of the “join” syntax and the rows()
table:
mt2 = mt2.key_rows_by("locus")
mt1 = mt1.filter_rows(hl.is_defined(mt2.rows()[mt1.locus]))
mt2.rows()
says, give me the rows table (the variants table, without genotypes) of mt2
.
mt2.rows()[mt1.locus]
is a “join”, which finds the row in mt2.rows()
such that it’s row key is equal to mt1.locus
. If no such matching row exists, mt2.rows()[mt1.locus]
is a missing value.
hl.is_defined(x)
checks if something is not a missing value.
There’s a new nicer syntax for this:
mt1 = mt1.filter_rows(hl.is_defined(mt2.rows()[mt1.locus]))
It’s called a “semi join”:
mt1 = mt1.semi_join_rows(mt2.rows())
@danking Thanks, that works!
@tpoterba
mt1 = mt1.semi_join_rows(mt2.rows())
Gives the following error:
AttributeError: MatrixTable instance has no field, method, or property 'semi_join_rows. Did you mean: MatrixTable method: 'union_rows’
When I use ‘union_rows’ instead:
hail.matrixtable.MatrixTable, found hail.table.Table: <hail.table.Table object at 0x7fbec5d6a4a8>'
However, both mt1 and mt2 are typed as MatrixTable.
What is going wrong here?
semi_join_rows was added just a couple days ago, you’ll need version 0.2.11 or later