Best way to "filter" by rsid?

Hi,

I can see now how the variants parameter of import_bgen() can be used to filter for variants rather rapidly, as opposed to using mt.filter_rows() on the matrix table returned from import_bgen() without using the variants parameter.

However, there are usage cases where filtering by rsid is needed.

Is there a best strategy for doing so using BGEN or VCF as input that is similar in performance to using the variants parameter of import_bgen()?

Thanks,

Vince

Hey Vince,

There isn’t something like this for BGEN or VCF, but you can do this with hail tables and matrix tables. If you have a table keyed by RSID or a MatrixTable whose row key is by RSID, you can use Hail | Genetics to quickly downsample to a particular set of ranges of rsids. When we see a filter_intervals that is on the field a table is keyed by we should do an optimization to make sure we filter faster by only looking at certain files in the dataset.

Cool!

So I:

  1. Build a Hail table that contains rsid, chrom, pos, a1 and a2.
  2. Query table by rsid and convert results to “chrom:pos:a1:a2”.
  3. Feed results from 2 into import_bgen.

Is this the correct approach?

Vince

I think that should work!