Import only a few variants using import_vcf

I would like to import a small sample of variants (~200) from large VCF files. For BGEN files, the import_bgen function has a nice parameter “variants” that allows for this. The function import_vcf, however, does not seem to have anything comparable.

Is there a way to import a list of variants from a (compressed) VCF?

The variants option on import_bgen is only because there are specific performance optimizations we wanted to perform. You can filter with either a literal or a table:

mt = hl.import_vcf(...)

# option 1, semi_join_rows. Short, but will end up reading whole VCF
mt = mt.semi_join_rows(variants_to_include)

# option 2, filtering with a set, which generates a very efficient query plan
variants_to_include_set = hl.literal(set(variants_to_include.key.collect()))
mt = mt.filter_rows(variants_to_include_set.contains(mt.row_key))

Thanks, Tim. It sounds like option 2 will be the fastest. I’ll try that.

A post was split to a new topic: I get an error using contains: set element type: ‘struct{locus: locus}’ type of arg ‘item’: ‘locus’