Subsample vcf based on a file of rsids


#1

Hi,

I have a “keep” file for rsid’s that I want to keep in my vcf files. What is the best way of doing this?
rsidTable = hl.import_table(ids) # This is my keep list
vcfs = hl.hadoop_ls(vcfloc)
for loc in vcfs: # vcfs are split by chrom
....name = loc['path'].split("/")[-1]
....vcf = hl.import_vcf(name, reference_genome=None)

eventually, I’d like to consolidate the vcfs from each chromosome into a single file. Any help on any part of that is much appreciated but right now I’m mainly stuck on how to subset the vcf by rsids.

Thanks


#2

Are your VCF files different shards of the same callset (same samples, different variants?) If so then you can import a list:

mt = hl.import_vcfs(vcfs)

Do the rsids in the rsidTable map onto the ID field of the VCF? if so, then you can do the following:

rsidTable.key_by('rsid') # assumes rsid field
mt = mt.filter_rows(hl.is_defined(rsidTable[mt.rsid]))