Filter_rows by list of genes

When I filter like this:

filter_condition = ((mt.info.SYMBOL == ‘GENE1’)|(mt.info.SYMBOL == ‘GENE2’)|(mt.info.SYMBOL == ‘GENE3’))
mt.filter_rows(filter_condition).count()

it works well. How can I filter not by specifying each gene individually, but use a list of genes, e.g. by importing it as a table.

Great question, try this:

mt = mt.filter_rows(hl.literal(['GENE1', 'GENE2',...]).contains(mt.info.SYMBOL))

If you have a table you could do this:

genes = hl.import_table(...).gene_name.collect()
mt = mt.filter_rows(hl.literal(genes).contains(mt.info.SYMBOL))

If your table was really big (e.g. it contained loci), there are better ways to do this, but for small sets (like thousands of genes), this is usually better.

2 Likes

Thanks - that works ! I was trying with hl.eval_typed instead of hl.literal.

Hey @danking

I’ve been trying to do something similar to johnnyr’s post and I keep getting an error:

mt_rsid.filter_rows(snpIds.contains(mt_rsid.dbSNP_rsid[‘rsid’])).show(n_cols=5)

I get this error:

Error summary: HailException: no conversion found for contains(, array, array) => bool

I am utilizing the dbSNP annotation from the datasets to search rsID values from a genome extraction with the hopes of eventually creating a mt of only select rsID values for analysis.

This code does work for me similar to johnnyr

mt_rsid.filter_rows((mt_rsid.dbSNP_rsid[‘rsid’] == [(“RSVALUE#1”)])|(mt_rsid.dbSNP_rsid[‘rsid’] == [(“RSVALUE#2”)])).show(n_cols=5)

Hmm. Definitely a bad error message. We will fix that.

The issue is that you’re asking if snpIds (a list/array of strings) contains a list of rsids. Maybe you’re looking for

len(hl.set(snpIds).intersection(hl.set(mt_rsid....))) != 0

That asks if the intersection is non-zero, i.e. the sets have shared elements.

I’ll also say that this operation is probably very slow because Hail doesn’t index by rsid, it indexes by locus and alleles, by default.

Can you share your whole Hail script? You probably want to convert your snpIds list to a locus and alleles list and then use that to filter_rows.

Sure! Here is what I got. I utilized the annotation database to get the rsid values.

db = hl.experimental.DB(region='us', cloud='gcp')
mt = db.annotate_rows_db(mt, 'dbSNP_rsid')

keys = ['locus','alleles','rsid','dbSNP_rsid']
mt_rsid = mt.key_rows_by(*keys)

def create_dbSNP_rsid_filter(rsid_list) :
    rsfilter = [   ]
    for ID in rsid_list:
        rsfilter.append("""(mt_rsid.dbSNP_rsid['rsid'] == [("{}")])""".format(ID))
    rsfilter = '|'.join(rsfilter)
    
    return '(' + rsfilter + ')'

pc_mt_rsids = mt_rsid.filter_rows(eval(create_dbSNP_rsid_filter(snpIds)))

This was my workaround. Pretty much just expanded on what johnnyr did.

Yeah, unfortunately the dbSNP_rsid table is keyed by locus and alleles as is (I assume) your MT. That means it is fast to filter to locus and alleles and expensive to filter by anything else. We have some technology in the works to secondarily index tables so you can quickly filter on non-key columns. Even if that existed, we’d need to automatically detect that you’re filtering against a small enough number of RSIDs that we should use something called a “broadcast join” to do this much faster.

In general, Hail will be dramatically faster if you use locus & alleles to filter variants. We should really create a function to convert back and forth between these quickly. I’ll add it to the (ever growing) list.


Aside, you don’t need to use key_by or eval:

db = hl.experimental.DB(region='us', cloud='gcp')
mt = db.annotate_rows_db(mt, 'dbSNP_rsid')
mt = mt.filter_rows(hl.any(*[
    mt.dbSNP_rsid.rsid == snpid
    for snpid in snpIds
]))
1 Like