We are using hail 0.2.57. We checked the output from Vep 99 it annotated 79 transcript_consequences for this variant (4-113358472-T-C, hg38), but when we checked the mt.vep.transcript_consequences, the field is empty. Other variants look normal. Is there any limit how many transcript_consequences Hail can handle? It’s an array, so I don’t see any reason why it was discarded? Any idea would be appreciated!
Is the entire VEP field missing, or just transcript consequences? Sometimes VEP generates invalid JSON that we cannot parse. There’s a known case involving genes with names “nan” or something that triggers this.
Is there any limit how many transcript_consequences Hail can handle?
Hail currently has an array size limit of java MAX_INT (2 ** 31 - 1), but I don’t think that’s the problem here.
Yes, only transcript consequences for this particular variant is missing. Maybe some variants have the same issue that we didn’t discover, but most look normal for protein coding variants.
Yes, only transcript consequences for this particular variant is missing
Sorry, I wasn’t clear. Is the entire mt.vep
struct missing at the problematic variant, or do you see other fields in that struct, but transcript_consequences
alone is missing?
Sorry, maybe I just didn’t get your question. mt.vep struct looks good, all the field and most of data are there as it should except transcript_consequences value is empty for this particular variant which it shouldn’t. It’s ‘NA’ in the value. The only special thing we are seeing in the vep output is that it has many transcripts.
Is it possible to load the VEP output (JSON format) into a matrix table with the definition vep_json_schema from a function directly? I looked at import_matrix_table function, it doesn’t seem to provide that capability.
This is actually possible, though the types look a little different. Can you post your vep_json_schema here?
Great, thanks! Please see below
"vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele:String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele:String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed:Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact:String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype:String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences:Array[Struct{existing_inframe_oorfs:Int32,existing_outofframe_oorfs:Int32,existing_uorfs:Int32,five_prime_utr_variant_consequence:String,five_prime_utr_variant_annotation:String,allele_num:Int32,amino_acids:String,appris:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String],distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact:String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String,sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,tsl:Int32,uniparc:String,variant_allele:String}],variant_class:String}"
I also uploaded the output of the VEP for the variant in case that’s helpful.
4-113358472-T-C.txt (43.4 KB)
Is there anything we can try on our end? Thanks!
sorry, haven’t gotten to this yet. I’m trying to dig up a script I once had to convert the old-style VEP config types to something that can be parsed as a new-style Hail dtype in parse_json
I uploaded the subvet VCF as well. We upgraded the Hail to a latest version 0.2.70. The field value transcript_consequences is stilling missing for chr4:113358472 T>C. I attached another normal one as well. We are running on VEP 99. Could you help us to look at it? This is very important to us. Thanks!
subset.txt (309 KB)
Sorry for the delay! Lots to do at the moment. Here’s a function I used to convert that type into a hl.dtype
compatible one:
def reformat_dtype(s):
s = s.replace('Array', 'array')
s = s.replace('Set', 'set')
s = s.replace('Dict', 'dict')
s = s.replace('Struct', 'struct')
s = s.replace('Tuple', 'tuple')
s = s.replace('Int32', 'int32')
s = s.replace('Int64', 'int64')
s = s.replace('Float32', 'float32')
s = s.replace('Float64', 'float64')
s = s.replace('String', 'str')
s = s.replace('Binary', 'binary')
s = s.replace('Locus', 'locus')
s = s.replace('[', '<')
s = s.replace(']', '>')
return s
We can parse json using the hl.parse_json
function. I’m running
t = hl.dtype(reformat_dtype(TYPE_STR))
vep_output = open('that txt file', 'r').read()
hl.eval(hl.parse_json(vep_output, t)).transcript_consequences
This is giving me output. I’m super confused now.
One thing I am not sure if hl.parse_json and hl.vep call the same parser ultimately? I don’t quite get it when I looked at the codes. If not, these two results are not comparable? I didn’t look very deep though.
This is the method that gets called by hl.parse_json
:
Were you able to reproduce our issue with hl.vep? I just wanted to confirm if we are on the same page. Thanks!
No, it’s not easy for me to install VEP with your config. There might be a couple other ways to debug, though – I can make a debug build that prints/logs VEP output and Hail values as they come in.
Another request, though - can you run this variant through hl.vep and upload the Hail log file here?
Thanks for helping this! We run the hail on the AWS cluster. It’s our in-house version of a pipeline forked from a repo developed by Broad.
I uploaded the logs for both version 0.2.57 and 0.2.70. In 0.2.57 we did print out the vep on the fly. In 0.2.70, we didn’t print out, but we inspect the mt file it generated afterward. The transcript consequences value are missing for this variant chr4:113358472 T>C. Hope this helps!
hail_0257_log.txt (20.7 KB)
hail_0270_log.txt (22.2 KB)
Sorry, these aren’t the logs I mean – I’m talking about the .log
file that Hail writes during each session. The default location is the current working directory, but it can be parameterized by an argument to hl.init(log=...)