Filter all variants which belong to a sample ID

Hello Hail Dev team,

CHR 	POS 	REF 	ALT 	INFO 				FORMAT 		SAMPLE1 	SAMPLE2 	... 			SAMPLEn
chr1	1100	C	T	AF=0.3,GQ=20,...		GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1101	G	T	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1102	A	T	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1103	C	G	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1104	C	T	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1105	C	T	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
chr1	1106	C	T	AF=...,GQ=...			GT,AF,DP...	0/1,0.5,...	0/1,0.5,...				0/1,0.5,...
...

I want to filter all variants of a specific sample to create a structured data and export it to elasticsearch database. Data looks like in below example:

{
	"sample_ID": "SAMPLE1",
	"AF": 0.5,
	"GQ": 10,
	...
	variant_filter: [
		{
			"locus": {
			    "contig": "chr1",
			    "position": 1100
			  },
			  "alleles": [
			    "C",
			    "T"
			  ],
			"variant_class": "SNV",
			"consequences": ["intron_variant", ...],
			"population_allele_freq": 0,3,
			"population_genotype_quality": 20,
		},
		{
			"locus": {
			    "contig": "chr2",
			    "position": 1101
			  },
			  "alleles": [
			    "G",
			    "T"
			  ],
			"variant_class": "indel",
			"consequences": ["downstrean_gene_variant", ...],
			"population_allele_freq": 0,125,
			"population_genotype_quality": 50,
		},
		...	

	]
}

Does Hail support an easy way to do above work?

I can filter all subjects which have a specific variant by using following lines:

    subject_id_list = hl.agg.filter(mt.GT.is_het(), hl.agg.collect(mt.s))
    mt = mt.annotate_rows(carrier=subject_id_list)

But I can not find a similar way to filter all variants of a subject

You can do the same thing with annotate_cols. Don’t do this, though – you’ll blow memory limits.

It somewhat looks like you want the data transposed (sample-major) going into elasticsearch. Is that right?

I still can not figure out a way to filter all variants of a sample before annotate_cols, could you suggest by some lines of code. I will test memory consumsion after that.

variant_list = hl.agg.filter(mt.GT.is_het(), hl.agg.collect(mt.row_key))
mt = mt.annotate_cols(variants = variant_list)

Thank you @tpoterba. As my understanding, This code will require spark to load all lines in vcf file, and consume lots of memory? Can hail transpose data when import vcf file or I have to write a tool to do this work before import data to hail?

We currently do not have a transpose operation.