Accessing fields in structure type


#1

Hi,
I am attempting to filter data from the gnomad exomes vds keeping only loss of function instances. My attempts have brought me to the following:
vds.filter_intervals(Interval.parse(‘22’)).filter_variants_expr(‘va.vep.transcript_consequences.lof == None’, keep=False).count_variants()

I get:
Error summary: HailException: Array[Struct{allele_num:Int,amino_acids:String,biotype:String,canonical:Int,ccds:String,cdna_start:Int,cdna_end:Int,cds_end:Int,cds_start:Int,codons:String,consequence_terms:Array[String],distance:Int,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int,gene_symbol:String,gene_symbol_source:String,hgnc_id:Int,hgvsc:String,hgvsp:String,hgvs_offset:Int,impact:String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int,polyphen_prediction:String,polyphen_score:Double,protein_end:Int,protein_start:Int,protein_id:String,sift_prediction:String,sift_score:Double,strand:Int,swissprot:String,transcript_id:String,trembl:String,uniparc:String,variant_allele:String}]' has no field or methodlof’

lof is contained within the struct ‘transcript_consequences’.
How can I access that lof field and just keep data without ‘None’ in that field?

Thanks


#2

This is a bit more complicated because it’s not just a structure, it’s an Array[Struct{...}].

If you want to just select the one canonical transcript per variant, then here’s a bit of discourse on that:
Parsing VEP output

If you wanted to remove variants where lof was None for every transcript, then here’s the code to do that:

print(vds.filter_intervals(Interval.parse('22'))
.filter_variants_expr('va.vep.transcript_consequences.forall(tc => isMissing(tc.lof))')
.count_variants())

You might want to see the distribution of values for all transcripts:

print(vds.filter_intervals(Interval.parse('22'))
.query_variants('variants.flatMap(v => va.vep.transcript_consequences.map(tc => tc.lof)).counter()'))

#3

Thank you, the latter is what I was looking for. I am trying to export this information to a tsv, however this means I run into the same problem of not being able to refer directly to the lof fields. The line below works but results in a tonne of extraneous info.
vds_result.export_variants(‘test.tsv’, ‘v, va.vep.transcript_consequences.*’)
Is there restricting annotation output in that way?


#4

You could map it to the list of LOF things, comma delimit that, and export the result:

vds_result.export_variants('test.tsv', 
    'v, va.vep.transcript_consequences.map(tc => tc.lof).mkString(",")')