If your table was really big (e.g. it contained loci), there are better ways to do this, but for small sets (like thousands of genes), this is usually better.
Error summary: HailException: no conversion found for contains(, array, array) => bool
I am utilizing the dbSNP annotation from the datasets to search rsID values from a genome extraction with the hopes of eventually creating a mt of only select rsID values for analysis.
Yeah, unfortunately the dbSNP_rsid table is keyed by locus and alleles as is (I assume) your MT. That means it is fast to filter to locus and alleles and expensive to filter by anything else. We have some technology in the works to secondarily index tables so you can quickly filter on non-key columns. Even if that existed, we’d need to automatically detect that you’re filtering against a small enough number of RSIDs that we should use something called a “broadcast join” to do this much faster.
In general, Hail will be dramatically faster if you use locus & alleles to filter variants. We should really create a function to convert back and forth between these quickly. I’ll add it to the (ever growing) list.
Aside, you don’t need to use key_by or eval:
db = hl.experimental.DB(region='us', cloud='gcp')
mt = db.annotate_rows_db(mt, 'dbSNP_rsid')
mt = mt.filter_rows(hl.any(*[
mt.dbSNP_rsid.rsid == snpid
for snpid in snpIds
]))