Hi,
I can see now how the variants
parameter of import_bgen()
can be used to filter for variants rather rapidly, as opposed to using mt.filter_rows()
on the matrix table returned from import_bgen()
without using the variants
parameter.
However, there are usage cases where filtering by rsid is needed.
Is there a best strategy for doing so using BGEN or VCF as input that is similar in performance to using the variants
parameter of import_bgen()
?
Thanks,
Vince
Hey Vince,
There isn’t something like this for BGEN or VCF, but you can do this with hail tables and matrix tables. If you have a table keyed by RSID or a MatrixTable whose row key is by RSID, you can use Hail | Genetics to quickly downsample to a particular set of ranges of rsids. When we see a filter_intervals
that is on the field a table is keyed by we should do an optimization to make sure we filter faster by only looking at certain files in the dataset.
Cool!
So I:
- Build a Hail table that contains rsid, chrom, pos, a1 and a2.
- Query table by rsid and convert results to “chrom:pos:a1:a2”.
- Feed results from 2 into
import_bgen
.
Is this the correct approach?
Vince
I think that should work!