Transcript_consequences value is missing in matrix table

SimonLi5601 · June 22, 2021, 9:22pm

We are using hail 0.2.57. We checked the output from Vep 99 it annotated 79 transcript_consequences for this variant (4-113358472-T-C, hg38), but when we checked the mt.vep.transcript_consequences, the field is empty. Other variants look normal. Is there any limit how many transcript_consequences Hail can handle? It’s an array, so I don’t see any reason why it was discarded? Any idea would be appreciated!

tpoterba · June 22, 2021, 9:51pm

Is the entire VEP field missing, or just transcript consequences? Sometimes VEP generates invalid JSON that we cannot parse. There’s a known case involving genes with names “nan” or something that triggers this.

Is there any limit how many transcript_consequences Hail can handle?

Hail currently has an array size limit of java MAX_INT (2 ** 31 - 1), but I don’t think that’s the problem here.

SimonLi5601 · June 23, 2021, 1:53pm

Yes, only transcript consequences for this particular variant is missing. Maybe some variants have the same issue that we didn’t discover, but most look normal for protein coding variants.

tpoterba · June 23, 2021, 1:55pm

Yes, only transcript consequences for this particular variant is missing

Sorry, I wasn’t clear. Is the entire mt.vep struct missing at the problematic variant, or do you see other fields in that struct, but transcript_consequences alone is missing?

SimonLi5601 · June 23, 2021, 2:38pm

Sorry, maybe I just didn’t get your question. mt.vep struct looks good, all the field and most of data are there as it should except transcript_consequences value is empty for this particular variant which it shouldn’t. It’s ‘NA’ in the value. The only special thing we are seeing in the vep output is that it has many transcripts.

SimonLi5601 · June 23, 2021, 7:07pm

Is it possible to load the VEP output (JSON format) into a matrix table with the definition vep_json_schema from a function directly? I looked at import_matrix_table function, it doesn’t seem to provide that capability.

tpoterba · June 23, 2021, 7:27pm

This is actually possible, though the types look a little different. Can you post your vep_json_schema here?

SimonLi5601 · June 23, 2021, 7:41pm

Great, thanks! Please see below

"vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele:String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele:String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed:Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact:String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype:String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences:Array[Struct{existing_inframe_oorfs:Int32,existing_outofframe_oorfs:Int32,existing_uorfs:Int32,five_prime_utr_variant_consequence:String,five_prime_utr_variant_annotation:String,allele_num:Int32,amino_acids:String,appris:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String],distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact:String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String,sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,tsl:Int32,uniparc:String,variant_allele:String}],variant_class:String}"

SimonLi5601 · June 23, 2021, 7:55pm

I also uploaded the output of the VEP for the variant in case that’s helpful.

4-113358472-T-C.txt (43.4 KB)

SimonLi5601 · June 24, 2021, 6:37pm

Is there anything we can try on our end? Thanks!

tpoterba · June 24, 2021, 6:48pm

sorry, haven’t gotten to this yet. I’m trying to dig up a script I once had to convert the old-style VEP config types to something that can be parsed as a new-style Hail dtype in parse_json

SimonLi5601 · June 28, 2021, 8:50pm

I uploaded the subvet VCF as well. We upgraded the Hail to a latest version 0.2.70. The field value transcript_consequences is stilling missing for chr4:113358472 T>C. I attached another normal one as well. We are running on VEP 99. Could you help us to look at it? This is very important to us. Thanks!
subset.txt (309 KB)

tpoterba · June 29, 2021, 1:54am

Sorry for the delay! Lots to do at the moment. Here’s a function I used to convert that type into a hl.dtype compatible one:

def reformat_dtype(s):
    s = s.replace('Array', 'array')
    s = s.replace('Set', 'set')
    s = s.replace('Dict', 'dict')
    s = s.replace('Struct', 'struct')
    s = s.replace('Tuple', 'tuple')
    s = s.replace('Int32', 'int32')
    s = s.replace('Int64', 'int64')
    s = s.replace('Float32', 'float32')
    s = s.replace('Float64', 'float64')
    s = s.replace('String', 'str')
    s = s.replace('Binary', 'binary')
    s = s.replace('Locus', 'locus')
    s = s.replace('[', '<')
    s = s.replace(']', '>')
    return s

We can parse json using the hl.parse_json function. I’m running

t = hl.dtype(reformat_dtype(TYPE_STR))
vep_output = open('that txt file', 'r').read()
hl.eval(hl.parse_json(vep_output, t)).transcript_consequences

This is giving me output. I’m super confused now.

SimonLi5601 · June 29, 2021, 1:16pm

One thing I am not sure if hl.parse_json and hl.vep call the same parser ultimately? I don’t quite get it when I looked at the codes. If not, these two results are not comparable? I didn’t look very deep though.

github.com

hail-is/hail/blob/1a861505c1fc2ea3c9d7b32a47be7af10d13907c/hail/src/main/scala/is/hail/methods/VEP.scala#L203

    
      
                    warn(s"No CSQ INFO field for VEP output variant ${ VariantMethods.locusAllelesToString(vepLocus, vepAlleles) }.\nVEP output: $s.")
                    null
                }
                Some((Annotation(locus, alleles), a))
              case None =>
                fatal(s"VEP output variant ${ VariantMethods.locusAllelesToString(vepLocus, vepAlleles) } not found in original variants.\nVEP output: $s")
            }
          } else {
            try {
              val jv = JsonMethods.parse(s)
              val a = JSONAnnotationImpex.importAnnotation(jv, localVepSignature, warnContext = warnContext)
              val variantString = inputQuery(a).asInstanceOf[String]
              if (variantString == null)
                fatal(s"VEP generated null variant string" +
                  s"\n  json:   $s" +
                  s"\n  parsed: $a")
              val vepv@(vepLocus, vepAlleles) = variantFromInput(variantString)
          
          
    nonStarToOriginalVariant.get(vepv) match {
                case Some(v@(locus, alleles)) =>
                  Some((Annotation(locus, alleles), a))

tpoterba · June 29, 2021, 1:21pm

This is the method that gets called by hl.parse_json:

github.com

hail-is/hail/blob/c14bce455d2a66fe368726b110799b703c8b5fe3/hail/src/main/scala/is/hail/expr/AnnotationImpex.scala#L143-L151

    
      
          def irImportAnnotation(s: String, t: Type, warnContext: mutable.HashSet[String]): Row = {
            try {
              // wraps in a Row to handle returned missingness
              Row(importAnnotation(JsonMethods.parse(s), t, true, warnContext))
            } catch {
              case e: Throwable =>
                fatal(s"Error parsing JSON:\n  type: $t\n  value: $s", e)
            }
          }

SimonLi5601 · June 29, 2021, 1:37pm

Were you able to reproduce our issue with hl.vep? I just wanted to confirm if we are on the same page. Thanks!

tpoterba · June 29, 2021, 4:54pm

No, it’s not easy for me to install VEP with your config. There might be a couple other ways to debug, though – I can make a debug build that prints/logs VEP output and Hail values as they come in.

tpoterba · June 29, 2021, 4:59pm

Another request, though - can you run this variant through hl.vep and upload the Hail log file here?

SimonLi5601 · June 29, 2021, 6:39pm

Thanks for helping this! We run the hail on the AWS cluster. It’s our in-house version of a pipeline forked from a repo developed by Broad.

I uploaded the logs for both version 0.2.57 and 0.2.70. In 0.2.57 we did print out the vep on the fly. In 0.2.70, we didn’t print out, but we inspect the mt file it generated afterward. The transcript consequences value are missing for this variant chr4:113358472 T>C. Hope this helps!

hail_0257_log.txt (20.7 KB)
hail_0270_log.txt (22.2 KB)

tpoterba · June 30, 2021, 12:52pm

Sorry, these aren’t the logs I mean – I’m talking about the .log file that Hail writes during each session. The default location is the current working directory, but it can be parameterized by an argument to hl.init(log=...)

Topic		Replies	Views
UTRannotator in hail's VEP is not shown in transcript_consequences field Hail Query & hailctl	9	898	October 21, 2020
VEP command abruptly stops running Hail Query & hailctl	9	613	June 25, 2020
How to parse CSQ (VEP) field inside Hail 0.2? Hail Query & hailctl	17	2009	February 14, 2023
VEP: add field hgvsg Hail Query & hailctl	2	668	November 18, 2019
VEP fails to generate annotations Help [0.1]	17	2152	June 19, 2018

Transcript_consequences value is missing in matrix table

Related topics