Subsample vcf based on a file of rsids

apoursh · October 30, 2018, 11:41pm

Hi,

I have a “keep” file for rsid’s that I want to keep in my vcf files. What is the best way of doing this?
rsidTable = hl.import_table(ids) # This is my keep list
vcfs = hl.hadoop_ls(vcfloc)
for loc in vcfs: # vcfs are split by chrom
....name = loc['path'].split("/")[-1]
....vcf = hl.import_vcf(name, reference_genome=None)

eventually, I’d like to consolidate the vcfs from each chromosome into a single file. Any help on any part of that is much appreciated but right now I’m mainly stuck on how to subset the vcf by rsids.

Thanks

tpoterba · October 31, 2018, 1:01am

Are your VCF files different shards of the same callset (same samples, different variants?) If so then you can import a list:

mt = hl.import_vcfs(vcfs)

Do the rsids in the rsidTable map onto the ID field of the VCF? if so, then you can do the following:

rsidTable.key_by('rsid') # assumes rsid field
mt = mt.filter_rows(hl.is_defined(rsidTable[mt.rsid]))

Topic		Replies	Views
Trouble trying to subset a vcf Hail Query & hailctl	2	396	December 8, 2021
Export VCF by chromosome Hail Query & hailctl	2	321	September 28, 2021
Subset large vcf into multiple vcfs Hail Query & hailctl	7	814	February 27, 2020
Extract data to csv file Hail Query & hailctl	1	373	April 5, 2021
Merge vcfs like bcftools merge Hail Query & hailctl	6	884	April 3, 2020

Subsample vcf based on a file of rsids

Related Topics