Filter all variants which belong to a sample ID

toandd · December 12, 2019, 10:49am

Hello Hail Dev team,

CHR 	POS 	REF 	ALT 	INFO 				FORMAT 		SAMPLE1 	SAMPLE2 	... 			SAMPLEn
chr1	1100	C	T	AF=0.3,GQ=20,...		GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1101	G	T	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1102	A	T	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1103	C	G	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1104	C	T	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1105	C	T	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1106	C	T	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
...

I want to filter all variants of a specific sample to create a structured data and export it to elasticsearch database. Data looks like in below example:

{
	"sample_ID": "SAMPLE1",
	"AF": 0.5,
	"GQ": 10,
	...
	variant_filter: [
		{
			"locus": {
			    "contig": "chr1",
			    "position": 1100
			  },
			  "alleles": [
			    "C",
			    "T"
			  ],
			"variant_class": "SNV",
			"consequences": ["intron_variant", ...],
			"population_allele_freq": 0,3,
			"population_genotype_quality": 20,
		},
		{
			"locus": {
			    "contig": "chr2",
			    "position": 1101
			  },
			  "alleles": [
			    "G",
			    "T"
			  ],
			"variant_class": "indel",
			"consequences": ["downstrean_gene_variant", ...],
			"population_allele_freq": 0,125,
			"population_genotype_quality": 50,
		},
		...	

	]
}

Does Hail support an easy way to do above work?

toandd · December 12, 2019, 10:52am

I can filter all subjects which have a specific variant by using following lines:

    subject_id_list = hl.agg.filter(mt.GT.is_het(), hl.agg.collect(mt.s))
    mt = mt.annotate_rows(carrier=subject_id_list)

But I can not find a similar way to filter all variants of a subject

tpoterba · December 12, 2019, 12:42pm

You can do the same thing with annotate_cols. Don’t do this, though – you’ll blow memory limits.

It somewhat looks like you want the data transposed (sample-major) going into elasticsearch. Is that right?

toandd · December 12, 2019, 2:36pm

I still can not figure out a way to filter all variants of a sample before annotate_cols, could you suggest by some lines of code. I will test memory consumsion after that.

tpoterba · December 12, 2019, 6:42pm

variant_list = hl.agg.filter(mt.GT.is_het(), hl.agg.collect(mt.row_key))
mt = mt.annotate_cols(variants = variant_list)

toandd · December 13, 2019, 2:16am

Thank you @tpoterba. As my understanding, This code will require spark to load all lines in vcf file, and consume lots of memory? Can hail transpose data when import vcf file or I have to write a tool to do this work before import data to hail?

danking · December 13, 2019, 2:48pm

We currently do not have a transpose operation.

Topic		Replies	Views
Filter variants by sample id in gVCF Help [0.1]	20	1557	February 27, 2019
Filter variants in gvcf Hail Query & hailctl	3	502	November 3, 2020
Filter variants based on other files Hail Query & hailctl	3	442	February 9, 2022
Export VCF taking a long time, even when running in parallel Hail Query & hailctl	3	489	December 5, 2023
Trying to annotate vcf subset and then filter according to properties Hail Query & hailctl	9	101	March 21, 2025

Filter all variants which belong to a sample ID

Related topics