Subsample vcf based on a file of rsids


I have a “keep” file for rsid’s that I want to keep in my vcf files. What is the best way of doing this?
rsidTable = hl.import_table(ids) # This is my keep list
vcfs = hl.hadoop_ls(vcfloc)
for loc in vcfs: # vcfs are split by chrom = loc['path'].split("/")[-1]
....vcf = hl.import_vcf(name, reference_genome=None)

eventually, I’d like to consolidate the vcfs from each chromosome into a single file. Any help on any part of that is much appreciated but right now I’m mainly stuck on how to subset the vcf by rsids.


Are your VCF files different shards of the same callset (same samples, different variants?) If so then you can import a list:

mt = hl.import_vcfs(vcfs)

Do the rsids in the rsidTable map onto the ID field of the VCF? if so, then you can do the following:

rsidTable.key_by('rsid') # assumes rsid field
mt = mt.filter_rows(hl.is_defined(rsidTable[mt.rsid]))