Filter_rows by list of genes

johnnyr · April 8, 2021, 6:28pm

When I filter like this:

filter_condition = ((mt.info.SYMBOL == ‘GENE1’)|(mt.info.SYMBOL == ‘GENE2’)|(mt.info.SYMBOL == ‘GENE3’))
mt.filter_rows(filter_condition).count()

it works well. How can I filter not by specifying each gene individually, but use a list of genes, e.g. by importing it as a table.

danking · April 8, 2021, 6:31pm

Great question, try this:

mt = mt.filter_rows(hl.literal(['GENE1', 'GENE2',...]).contains(mt.info.SYMBOL))

If you have a table you could do this:

genes = hl.import_table(...).gene_name.collect()
mt = mt.filter_rows(hl.literal(genes).contains(mt.info.SYMBOL))

If your table was really big (e.g. it contained loci), there are better ways to do this, but for small sets (like thousands of genes), this is usually better.

johnnyr · April 8, 2021, 7:41pm

Thanks - that works ! I was trying with hl.eval_typed instead of hl.literal.

conwei123 · February 16, 2023, 5:08pm

Hey @danking

I’ve been trying to do something similar to johnnyr’s post and I keep getting an error:

mt_rsid.filter_rows(snpIds.contains(mt_rsid.dbSNP_rsid[‘rsid’])).show(n_cols=5)

I get this error:

Error summary: HailException: no conversion found for contains(, array, array) => bool

I am utilizing the dbSNP annotation from the datasets to search rsID values from a genome extraction with the hopes of eventually creating a mt of only select rsID values for analysis.

This code does work for me similar to johnnyr

mt_rsid.filter_rows((mt_rsid.dbSNP_rsid[‘rsid’] == [(“RSVALUE#1”)])|(mt_rsid.dbSNP_rsid[‘rsid’] == [(“RSVALUE#2”)])).show(n_cols=5)

danking · February 16, 2023, 7:02pm

Hmm. Definitely a bad error message. We will fix that.

The issue is that you’re asking if snpIds (a list/array of strings) contains a list of rsids. Maybe you’re looking for

len(hl.set(snpIds).intersection(hl.set(mt_rsid....))) != 0

That asks if the intersection is non-zero, i.e. the sets have shared elements.

I’ll also say that this operation is probably very slow because Hail doesn’t index by rsid, it indexes by locus and alleles, by default.

Can you share your whole Hail script? You probably want to convert your snpIds list to a locus and alleles list and then use that to filter_rows.

conwei123 · February 17, 2023, 3:37pm

Sure! Here is what I got. I utilized the annotation database to get the rsid values.

db = hl.experimental.DB(region='us', cloud='gcp')
mt = db.annotate_rows_db(mt, 'dbSNP_rsid')

keys = ['locus','alleles','rsid','dbSNP_rsid']
mt_rsid = mt.key_rows_by(*keys)

def create_dbSNP_rsid_filter(rsid_list) :
    rsfilter = [   ]
    for ID in rsid_list:
        rsfilter.append("""(mt_rsid.dbSNP_rsid['rsid'] == [("{}")])""".format(ID))
    rsfilter = '|'.join(rsfilter)
    
    return '(' + rsfilter + ')'

pc_mt_rsids = mt_rsid.filter_rows(eval(create_dbSNP_rsid_filter(snpIds)))

This was my workaround. Pretty much just expanded on what johnnyr did.

danking · February 17, 2023, 3:56pm

Yeah, unfortunately the dbSNP_rsid table is keyed by locus and alleles as is (I assume) your MT. That means it is fast to filter to locus and alleles and expensive to filter by anything else. We have some technology in the works to secondarily index tables so you can quickly filter on non-key columns. Even if that existed, we’d need to automatically detect that you’re filtering against a small enough number of RSIDs that we should use something called a “broadcast join” to do this much faster.

In general, Hail will be dramatically faster if you use locus & alleles to filter variants. We should really create a function to convert back and forth between these quickly. I’ll add it to the (ever growing) list.

Aside, you don’t need to use key_by or eval:

db = hl.experimental.DB(region='us', cloud='gcp')
mt = db.annotate_rows_db(mt, 'dbSNP_rsid')
mt = mt.filter_rows(hl.any(*[
    mt.dbSNP_rsid.rsid == snpid
    for snpid in snpIds
]))

Topic		Replies	Views
How to filter to a specific locus Hail Query & hailctl	4	945	February 28, 2023
More efficient way to extract calls? Hail Query & hailctl	2	370	December 14, 2022
Filter cols from python list Hail Query & hailctl	1	469	February 23, 2022
Subtring match in array<str> info column Hail Query & hailctl	3	411	November 30, 2021
Filter for specific variants Hail Query & hailctl	1	625	July 23, 2023

Filter_rows by list of genes

Related topics