I would like to import a small sample of variants (~200) from large VCF files. For BGEN files, the import_bgen function has a nice parameter “variants” that allows for this. The function import_vcf, however, does not seem to have anything comparable.
Is there a way to import a list of variants from a (compressed) VCF?
variants option on import_bgen is only because there are specific performance optimizations we wanted to perform. You can filter with either a literal or a table:
mt = hl.import_vcf(...)
# option 1, semi_join_rows. Short, but will end up reading whole VCF
mt = mt.semi_join_rows(variants_to_include)
# option 2, filtering with a set, which generates a very efficient query plan
variants_to_include_set = hl.literal(set(variants_to_include.key.collect()))
mt = mt.filter_rows(variants_to_include_set.contains(mt.row_key))
Thanks, Tim. It sounds like option 2 will be the fastest. I’ll try that.