Best way to "filter" by rsid?


I can see now how the variants parameter of import_bgen() can be used to filter for variants rather rapidly, as opposed to using mt.filter_rows() on the matrix table returned from import_bgen() without using the variants parameter.

However, there are usage cases where filtering by rsid is needed.

Is there a best strategy for doing so using BGEN or VCF as input that is similar in performance to using the variants parameter of import_bgen()?



Hey Vince,

There isn’t something like this for BGEN or VCF, but you can do this with hail tables and matrix tables. If you have a table keyed by RSID or a MatrixTable whose row key is by RSID, you can use Hail | Genetics to quickly downsample to a particular set of ranges of rsids. When we see a filter_intervals that is on the field a table is keyed by we should do an optimization to make sure we filter faster by only looking at certain files in the dataset.


So I:

  1. Build a Hail table that contains rsid, chrom, pos, a1 and a2.
  2. Query table by rsid and convert results to “chrom:pos:a1:a2”.
  3. Feed results from 2 into import_bgen.

Is this the correct approach?


I think that should work!