Can not export to elasticsearch database, don't see any error log

I used following code to export data to elasticsearch database

import hail as hl
import time
hl.init()
hl.plot.output_notebook()

mt = hl.read_matrix_table('/data/pilot.mt')
ht = hl.split_multi(mt,left_aligned=True)
ht = hl.variant_qc(ht, name='variant_qc')
ht.describe()
ht = ht.annotate_globals(gencodeVersion="25")
config = "file:///home/steve/Desktop/hail_spark/hail/vep95.json"
ht = hl.vep(ht, config=config)
htb = ht.rows()
# tb3 = htb.head(50).flatten().key_by('locus', 'alleles')
tb3 = htb.flatten().key_by('locus', 'alleles')
start = time.time()
hl.export_elasticsearch(tb3,host='localhost',port=9200,index='var1',index_type='qc',block_size=10,config=None)

end = time.time()
print("Processing time = {}".format(end - start))

When I take only 50 rows I am able to export them to elasticsearch database,
But when taking all table, I am not able to export data to elasticsearch database.
I used following vep configuration to annotate data:

{"command": [
  "/vep",
  "--format", "vcf",
  "__OUTPUT_FORMAT_FLAG__",
  "--everything",
  "--hgvsg",
  "--allele_number",
  "--no_stats",
  "--cache", "--offline",
  "--minimal",
  "--assembly", "GRCh38",
  "--fasta", "/opt/vep/.vep/homo_sapiens/95_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz",
  "--plugin",     "LoF,loftee_path:/opt/vep/Plugins/,gerp_bigwig:/opt/vep/.vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:/opt/vep/.vep/human_ancestor.fa.gz,conservation_file:/opt/vep/.vep/loftee.sql",
  "--dir_plugins", "/opt/vep/Plugins/",
  "-o", "STDOUT"
],
"env": {
   "PERL5LIB": "/vep_data/loftee"
},
"vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele:String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele:String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed:Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact:String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype:String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences:Array[Struct{allele_num:Int32,amino_acids:String,appris:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String],distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsg:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact:String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String,sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,tsl:Int32,uniparc:String,variant_allele:String}],variant_class:String}"

}

When export data to elasticsearch database I see a lot of warning logs, but did not see any error log, so I didn’t know what happened

2019-12-05 14:43:09 Hail: WARN: struct{aa_allele: str, aa_maf: float64, afr_allele: str, afr_maf: float64, allele_string: str, amr_allele: str, amr_maf: float64, clin_sig: array<str>, end: int32, eas_allele: str, eas_maf: float64, ea_allele: str, ea_maf: float64, eur_allele: str, eur_maf: float64, exac_adj_allele: str, exac_adj_maf: float64, exac_allele: str, exac_afr_allele: str, exac_afr_maf: float64, exac_amr_allele: str, exac_amr_maf: float64, exac_eas_allele: str, exac_eas_maf: float64, exac_fin_allele: str, exac_fin_maf: float64, exac_maf: float64, exac_nfe_allele: str, exac_nfe_maf: float64, exac_oth_allele: str, exac_oth_maf: float64, exac_sas_allele: str, exac_sas_maf: float64, id: str, minor_allele: str, minor_allele_freq: float64, phenotype_or_disease: int32, pubmed: array<int32>, sas_allele: str, sas_maf: float64, somatic: int32, start: int32, strand: int32} has no field seq_region_name at <root>.colocated_variants[element] for value JString(1)
2019-12-05 14:43:09 Hail: WARN: struct{assembly_name: str, allele_string: str, ancestral: str, colocated_variants: array<struct{aa_allele: str, aa_maf: float64, afr_allele: str, afr_maf: float64, allele_string: str, amr_allele: str, amr_maf: float64, clin_sig: array<str>, end: int32, eas_allele: str, eas_maf: float64, ea_allele: str, ea_maf: float64, eur_allele: str, eur_maf: float64, exac_adj_allele: str, exac_adj_maf: float64, exac_allele: str, exac_afr_allele: str, exac_afr_maf: float64, exac_amr_allele: str, exac_amr_maf: float64, exac_eas_allele: str, exac_eas_maf: float64, exac_fin_allele: str, exac_fin_maf: float64, exac_maf: float64, exac_nfe_allele: str, exac_nfe_maf: float64, exac_oth_allele: str, exac_oth_maf: float64, exac_sas_allele: str, exac_sas_maf: float64, id: str, minor_allele: str, minor_allele_freq: float64, phenotype_or_disease: int32, pubmed: array<int32>, sas_allele: str, sas_maf: float64, somatic: int32, start: int32, strand: int32}>, context: str, end: int32, id: str, input: str, intergenic_consequences: array<struct{allele_num: int32, consequence_terms: array<str>, impact: str, minimised: int32, variant_allele: str}>, most_severe_consequence: str, motif_feature_consequences: array<struct{allele_num: int32, consequence_terms: array<str>, high_inf_pos: str, impact: str, minimised: int32, motif_feature_id: str, motif_name: str, motif_pos: int32, motif_score_change: float64, strand: int32, variant_allele: str}>, regulatory_feature_consequences: array<struct{allele_num: int32, biotype: str, consequence_terms: array<str>, impact: str, minimised: int32, regulatory_feature_id: str, variant_allele: str}>, seq_region_name: str, start: int32, strand: int32, transcript_consequences: array<struct{allele_num: int32, amino_acids: str, appris: str, biotype: str, canonical: int32, ccds: str, cdna_start: int32, cdna_end: int32, cds_end: int32, cds_start: int32, codons: str, consequence_terms: array<str>, distance: int32, domains: array<struct{db: str, name: str}>, exon: str, gene_id: str, gene_pheno: int32, gene_symbol: str, gene_symbol_source: str, hgnc_id: str, hgvsg: str, hgvsc: str, hgvsp: str, hgvs_offset: int32, impact: str, intron: str, lof: str, lof_flags: str, lof_filter: str, lof_info: str, minimised: int32, polyphen_prediction: str, polyphen_score: float64, protein_end: int32, protein_start: int32, protein_id: str, sift_prediction: str, sift_score: float64, strand: int32, swissprot: str, transcript_id: str, trembl: str, tsl: int32, uniparc: str, variant_allele: str}>, variant_class: str} has no field minimised at <root> for value JInt(1)

I used docker.elastic.co/elasticsearch/elasticsearch-oss:6.5.4 and konradjk/vep95_loftee:0.2 to deploy elasticsearch database and vep tool
Spark version : 2.4.0
Hail version: version 0.2.26-2dcc3d963867
And all component were deployed in my work station

Thanks

After comment this line hl.plot.output_notebook()
I can see following log

Hail version: 0.2.26-2dcc3d963867
Error summary: HailException: found allele outside of expected range [0, 2]: 3
[Stage 0:>                                                         (0 + 5) / 23]

What is the meaning of above logs?

Can you post the full stack trace?

The short answer is that some input of yours has a genotype call like 0/3 but the site is tri-allelic. 0/3 doesn’t make sense at a site with only three alleles.

Hi @danking Here is the full stack trace

Traceback (most recent call last):
  File "/home/steve/Desktop/mash_system/dgv4vn/vcf_parser/hail_module/hail_parser.py", line 18, in <module>
    hl.export_elasticsearch(tb3,host='localhost',port=9200,index='var1',index_type='qc',block_size=10,config=None)
  File "</home/steve/Desktop/mash_system/dgv4vn/vcf_parser/venv/lib/python3.6/site-packages/decorator.py:decorator-gen-1234>", line 2, in export_elasticsearch
  File "/home/steve/Desktop/mash_system/dgv4vn/vcf_parser/venv/lib/python3.6/site-packages/hail/typecheck/check.py", line 585, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/home/steve/Desktop/mash_system/dgv4vn/vcf_parser/venv/lib/python3.6/site-packages/hail/methods/impex.py", line 2273, in export_elasticsearch
    jdf = t.expand_types().to_spark(flatten=False)._jdf
  File "</home/steve/Desktop/mash_system/dgv4vn/vcf_parser/venv/lib/python3.6/site-packages/decorator.py:decorator-gen-948>", line 2, in to_spark
  File "/home/steve/Desktop/mash_system/dgv4vn/vcf_parser/venv/lib/python3.6/site-packages/hail/typecheck/check.py", line 585, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/home/steve/Desktop/mash_system/dgv4vn/vcf_parser/venv/lib/python3.6/site-packages/hail/table.py", line 3058, in to_spark
    return Env.spark_backend('to_spark').to_spark(self, flatten)
  File "/home/steve/Desktop/mash_system/dgv4vn/vcf_parser/venv/lib/python3.6/site-packages/hail/backend/backend.py", line 150, in to_spark
    return pyspark.sql.DataFrame(self._to_java_ir(t._tir).pyToDF(), Env.spark_session()._wrapped)
  File "/home/steve/Desktop/mash_system/dgv4vn/vcf_parser/venv/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home/steve/Desktop/mash_system/dgv4vn/vcf_parser/venv/lib/python3.6/site-packages/hail/utils/java.py", line 225, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest)) from None
hail.utils.java.FatalError: HailException: found allele outside of expected range [0, 2]: 3

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 0.0 failed 1 times, most recent failure: Lost task 4.0 in stage 0.0 (TID 4, localhost, executor driver): is.hail.utils.HailException: found allele outside of expected range [0, 2]: 3
    at is.hail.codegen.generated.C31.method_4(Unknown Source)
    at is.hail.codegen.generated.C31.apply(Unknown Source)
    at is.hail.codegen.generated.C31.apply(Unknown Source)
    at is.hail.expr.ir.TableMapRows$$anonfun$54$$anonfun$apply$19.apply(TableIR.scala:1095)
    at is.hail.expr.ir.TableMapRows$$anonfun$54$$anonfun$apply$19.apply(TableIR.scala:1094)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at scala.collection.Iterator$$anon$12.next(Iterator.scala:445)
    at is.hail.io.RichContextRDDRegionValue$$anonfun$boundary$extension$1$$anon$1.next(RichContextRDDRegionValue.scala:193)
    at is.hail.io.RichContextRDDRegionValue$$anonfun$boundary$extension$1$$anon$1.next(RichContextRDDRegionValue.scala:177)
    at scala.collection.Iterator$$anon$12.next(Iterator.scala:445)
    at scala.collection.Iterator$$anon$1.next(Iterator.scala:1008)
    at scala.collection.Iterator$$anon$1.head(Iterator.scala:995)
    at is.hail.utils.richUtils.RichIterator$$anon$5.value(RichIterator.scala:21)
    at is.hail.utils.StagingIterator.value(FlipbookIterator.scala:49)
    at is.hail.utils.FlipbookIterator$$anon$9.setValue(FlipbookIterator.scala:331)
    at is.hail.utils.FlipbookIterator$$anon$9.advance(FlipbookIterator.scala:341)
    at is.hail.utils.StagingIterator.advance(FlipbookIterator.scala:53)
    at is.hail.utils.FlipbookIterator$$anon$5.advance(FlipbookIterator.scala:179)
    at is.hail.utils.StagingIterator.stage(FlipbookIterator.scala:61)
    at is.hail.utils.StagingIterator.hasNext(FlipbookIterator.scala:71)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)
    at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:306)
    at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:304)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1887)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1875)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1874)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1874)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2108)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2057)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2046)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
    at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:309)
    at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:171)
    at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:151)
    at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
    at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
    at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:623)
    at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:624)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.sortBy(RDD.scala:621)
    at is.hail.expr.ir.TableOrderBy.execute(TableIR.scala:1856)
    at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:967)
    at is.hail.expr.ir.TableMapGlobals.execute(TableIR.scala:1123)
    at is.hail.expr.ir.TableMapRows.execute(TableIR.scala:967)
    at is.hail.expr.ir.TableMapGlobals.execute(TableIR.scala:1123)
    at is.hail.expr.ir.Interpret$.apply(Interpret.scala:35)
    at is.hail.expr.ir.Interpret$.apply(Interpret.scala:20)
    at is.hail.expr.ir.TableIR$$anonfun$pyToDF$1.apply(TableIR.scala:84)
    at is.hail.expr.ir.TableIR$$anonfun$pyToDF$1.apply(TableIR.scala:83)
    at is.hail.utils.package$.using(package.scala:596)
    at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:10)
    at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:9)
    at is.hail.utils.package$.using(package.scala:596)
    at is.hail.annotations.Region$.scoped(Region.scala:18)
    at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:9)
    at is.hail.expr.ir.TableIR.pyToDF(TableIR.scala:83)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

is.hail.utils.HailException: found allele outside of expected range [0, 2]: 3
    at is.hail.codegen.generated.C31.method_4(Unknown Source)
    at is.hail.codegen.generated.C31.apply(Unknown Source)
    at is.hail.codegen.generated.C31.apply(Unknown Source)
    at is.hail.expr.ir.TableMapRows$$anonfun$54$$anonfun$apply$19.apply(TableIR.scala:1095)
    at is.hail.expr.ir.TableMapRows$$anonfun$54$$anonfun$apply$19.apply(TableIR.scala:1094)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at scala.collection.Iterator$$anon$12.next(Iterator.scala:445)
    at is.hail.io.RichContextRDDRegionValue$$anonfun$boundary$extension$1$$anon$1.next(RichContextRDDRegionValue.scala:193)
    at is.hail.io.RichContextRDDRegionValue$$anonfun$boundary$extension$1$$anon$1.next(RichContextRDDRegionValue.scala:177)
    at scala.collection.Iterator$$anon$12.next(Iterator.scala:445)
    at scala.collection.Iterator$$anon$1.next(Iterator.scala:1008)
    at scala.collection.Iterator$$anon$1.head(Iterator.scala:995)
    at is.hail.utils.richUtils.RichIterator$$anon$5.value(RichIterator.scala:21)
    at is.hail.utils.StagingIterator.value(FlipbookIterator.scala:49)
    at is.hail.utils.FlipbookIterator$$anon$9.setValue(FlipbookIterator.scala:331)
    at is.hail.utils.FlipbookIterator$$anon$9.advance(FlipbookIterator.scala:341)
    at is.hail.utils.StagingIterator.advance(FlipbookIterator.scala:53)
    at is.hail.utils.FlipbookIterator$$anon$5.advance(FlipbookIterator.scala:179)
    at is.hail.utils.StagingIterator.stage(FlipbookIterator.scala:61)
    at is.hail.utils.StagingIterator.hasNext(FlipbookIterator.scala:71)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.util.random.SamplingUtils$.reservoirSampleAndCount(SamplingUtils.scala:41)
    at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:306)
    at org.apache.spark.RangePartitioner$$anonfun$13.apply(Partitioner.scala:304)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)




Hail version: 0.2.26-2dcc3d963867
Error summary: HailException: found allele outside of expected range [0, 2]: 3
2019-12-06 08:55:01 Hail: WARN: struct{assembly_name: str, allele_string: str, ancestral: str, colocated_variants: array<struct{aa_allele: str, aa_maf: float64, afr_allele: str, afr_maf: float64, allele_string: str, amr_allele: str, amr_maf: float64, clin_sig: array<str>, end: int32, eas_allele: str, eas_maf: float64, ea_allele: str, ea_maf: float64, eur_allele: str, eur_maf: float64, exac_adj_allele: str, exac_adj_maf: float64, exac_allele: str, exac_afr_allele: str, exac_afr_maf: float64, exac_amr_allele: str, exac_amr_maf: float64, exac_eas_allele: str, exac_eas_maf: float64, exac_fin_allele: str, exac_fin_maf: float64, exac_maf: float64, exac_nfe_allele: str, exac_nfe_maf: float64, exac_oth_allele: str, exac_oth_maf: float64, exac_sas_allele: str, exac_sas_maf: float64, id: str, minor_allele: str, minor_allele_freq: float64, phenotype_or_disease: int32, pubmed: array<int32>, sas_allele: str, sas_maf: float64, somatic: int32, start: int32, strand: int32}>, context: str, end: int32, id: str, input: str, intergenic_consequences: array<struct{allele_num: int32, consequence_terms: array<str>, impact: str, minimised: int32, variant_allele: str}>, most_severe_consequence: str, motif_feature_consequences: array<struct{allele_num: int32, consequence_terms: array<str>, high_inf_pos: str, impact: str, minimised: int32, motif_feature_id: str, motif_name: str, motif_pos: int32, motif_score_change: float64, strand: int32, variant_allele: str}>, regulatory_feature_consequences: array<struct{allele_num: int32, biotype: str, consequence_terms: array<str>, impact: str, minimised: int32, regulatory_feature_id: str, variant_allele: str}>, seq_region_name: str, start: int32, strand: int32, transcript_consequences: array<struct{allele_num: int32, amino_acids: str, appris: str, biotype: str, canonical: int32, ccds: str, cdna_start: int32, cdna_end: int32, cds_end: int32, cds_start: int32, codons: str, consequence_terms: array<str>, distance: int32, domains: array<struct{db: str, name: str}>, exon: str, gene_id: str, gene_pheno: int32, gene_symbol: str, gene_symbol_source: str, hgnc_id: str, hgvsg: str, hgvsc: str, hgvsp: str, hgvs_offset: int32, impact: str, intron: str, lof: str, lof_flags: str, lof_filter: str, lof_info: str, minimised: int32, polyphen_prediction: str, polyphen_score: float64, protein_end: int32, protein_start: int32, protein_id: str, sift_prediction: str, sift_score: float64, strand: int32, swissprot: str, transcript_id: str, trembl: str, tsl: int32, uniparc: str, variant_allele: str}>, variant_class: str} has no field minimised at <root> for value JInt(1)
log4j:WARN Detected problem with connection: java.net.SocketException: Connection reset

Process finished with exit code 1

This is an example of our data:

chr1    1195500 .       C       CTTTTTTTTTTTTTTTTTTTTT,CTTTTTTTTTTTTTTTTTTTTTT,CTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT,CTTTTTTTTTTT,CTTTTTTTTTTTTTTTTTTTTTTTTTTT,CTTTTTTTTTTTTTTTTTTTTTTTTTT    444.21  .       AC=5,4,1,3,2,1;AF=0.208,0.167,0.042,0.125,0.083,0.042;AN=24;DP=209;FS=9.929;MQ=216.82;MQRankSum=0.118;QD=0.89;ReadPosRankSum=1.060;SOR=0.101    GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP       0/0:17,3,3,3,3,3,3:0.150,0.150,0.150,0.150,0.150,0.150:20:0:PASS:.:.:0,0,472,0,472,472,0,472,472,472,0,472,472,472,472,0,472,472,472,472,472,0,472,472,472,472,472,472:.        1/1:0,8,0,0,0,0,0:1.000,0.000,0.000,0.000,0.000,0.000:8:20:indel:0,4,0,0,0,0,0:0,4,0,0,0,0,0:77,23,0,332,24,78,332,24,78,78,332,24,78,78,78,332,24,78,78,78,78,332,24,78,78,78,78,78:7.025e+01,2.025e+01,4.121e-02,3.596e+02,8.659e+01,1.088e+02,3.596e+02,8.659e+01,1.088e+02,1.088e+02,3.596e+02,8.659e+01,1.088e+02,1.088e+02,1.088e+02,3.596e+02,8.659e+01,1.088e+02,1.088e+02,1.088e+02,1.088e+02,3.596e+02,8.659e+01,1.088e+02,1.088e+02,1.088e+02,1.088e+02,1.088e+02    4/4:0,0,0,0,13,0,0:0.000,0.000,0.000,1.000,0.000,0.000:13:35:indel:0,0,0,0,9,0,0:0,0,0,0,4,0,0:92,584,93,584,93,93,584,93,93,93,38,39,39,39,0,584,93,93,93,39,93,584,93,93,93,39,93,93:8.524e+01,4.500e+02,1.239e+02,4.500e+02,1.239e+02,1.239e+02,4.500e+02,1.239e+02,1.239e+02,1.239e+02,3.524e+01,1.017e+02,1.017e+02,1.017e+02,1.301e-03,4.500e+02,1.239e+02,1.239e+02,1.239e+02,1.017e+02,1.239e+02,4.500e+02,1.239e+02,1.239e+02,1.239e+02,1.017e+02,1.239e+02,1.239e+02  5/5:0,0,0,0,0,10,0:0.000,0.000,0.000,0.000,1.000,0.000:10:28:indel:0,0,0,0,0,3,0:0,0,0,0,0,7,0:84,398,83,398,83,83,398,83,83,83,398,83,83,83,83,31,30,30,30,30,0,398,83,83,83,83,30,83:7.788e+01,4.268e+02,1.149e+02,4.268e+02,1.149e+02,1.149e+02,4.268e+02,1.149e+02,1.149e+02,1.149e+02,4.268e+02,1.149e+02,1.149e+02,1.149e+02,1.149e+02,2.788e+01,9.362e+01,9.362e+01,9.362e+01,9.362e+01,7.092e-03,4.268e+02,1.149e+02,1.149e+02,1.149e+02,1.149e+02,9.362e+01,1.149e+02  0/0:24,3,3,3,3,3,3:0.111,0.111,0.111,0.111,0.111,0.111:27:8:PASS:.:.:0,8,733,8,733,733,8,733,733,733,8,733,733,733,733,8,733,733,733,733,733,8,733,733,733,733,733,733:.        1/3:0,3,0,4,0,0,0:0.429,0.000,0.571,0.000,0.000,0.000:7:46:indel:0,3,0,3,0,0,0:0,0,0,1,0,0,0:174,128,48,242,116,162,120,0,109,54,242,116,162,109,162,242,116,162,109,162,162,242,116,162,109,162,162,162:1.658e+02,1.242e+02,4.667e+01,2.683e+02,1.774e+02,1.922e+02,1.158e+02,1.155e-04,1.701e+02,5.297e+01,2.683e+02,1.774e+02,1.922e+02,1.701e+02,1.922e+02,2.683e+02,1.774e+02,1.922e+02,1.701e+02,1.922e+02,1.922e+02,2.683e+02,1.774e+02,1.922e+02,1.701e+02,1.922e+02,1.922e+02,1.922e+02        1/1:0,4,0,1,0,0,0:0.800,0.000,0.200,0.000,0.000,0.000:5:5:indel:0,1,0,1,0,0,0:0,3,0,0,0,0,0:76,22,0,149,12,66,140,3,129,57,149,12,66,129,66,149,12,66,129,66,66,149,12,66,129,66,66,66:7.099e+01,2.099e+01,1.596e+00,1.786e+02,7.632e+01,9.855e+01,1.387e+02,5.236e+00,1.930e+02,5.824e+01,1.786e+02,7.632e+01,9.855e+01,1.930e+02,9.855e+01,1.786e+02,7.632e+01,9.855e+01,1.930e+02,9.855e+01,9.855e+01,1.786e+02,7.632e+01,9.855e+01,1.930e+02,9.855e+01,9.855e+01,9.855e+01  0/4:2,0,0,0,3,0,0:0.000,0.000,0.000,0.600,0.000,0.000:5:32:indel:1,0,0,0,3,0,0:1,0,0,0,0,0,0:51,120,93,120,93,93,120,93,93,93,0,39,39,39,30,120,93,93,93,39,93,120,93,93,93,39,93,93:4.691e+01,1.506e+02,1.263e+02,1.506e+02,1.263e+02,1.263e+02,1.506e+02,1.263e+02,1.263e+02,1.263e+02,2.514e-03,1.041e+02,1.041e+02,1.041e+02,3.253e+01,1.506e+02,1.263e+02,1.263e+02,1.263e+02,1.041e+02,1.263e+02,1.506e+02,1.263e+02,1.263e+02,1.263e+02,1.041e+02,1.263e+02,1.263e+02    0/6:9,0,0,0,0,0,1:0.000,0.000,0.000,0.000,0.000,0.091:11:43:indel:3,0,0,0,0,0,0:6,0,0,0,0,0,1:47,42,92,42,92,92,42,92,92,92,42,92,92,92,92,42,92,92,92,92,92,0,370,370,370,370,370,50:4.298e+01,7.249e+01,1.255e+02,7.249e+01,1.255e+02,1.255e+02,7.249e+01,1.255e+02,1.255e+02,1.255e+02,7.249e+01,1.255e+02,1.255e+02,1.255e+02,1.255e+02,7.249e+01,1.255e+02,1.255e+02,1.255e+02,1.255e+02,1.255e+02,3.322e-02,4.357e+02,4.357e+02,4.357e+02,4.357e+02,4.357e+02,5.303e+01   0/0:13,3,3,3,3,3,3:0.188,0.188,0.188,0.188,0.188,0.188:16:0:PASS:.:.:0,0,212,0,212,212,0,212,212,212,0,212,212,212,212,0,212,212,212,212,212,0,212,212,212,212,212,212:.        2/2:0,0,15,0,0,0,0:0.000,1.000,0.000,0.000,0.000,0.000:15:42:indel:0,0,9,0,0,0,0:0,0,6,0,0,0,0:99,654,99,45,45,0,654,99,45,99,654,99,45,99,99,654,99,45,99,99,99,654,99,45,99,99,99,99:9.152e+01,4.500e+02,1.299e+02,4.152e+01,1.077e+02,3.065e-04,4.500e+02,1.299e+02,1.077e+02,1.299e+02,4.500e+02,1.299e+02,1.077e+02,1.299e+02,1.299e+02,4.500e+02,1.299e+02,1.077e+02,1.299e+02,1.299e+02,1.299e+02,4.500e+02,1.299e+02,1.077e+02,1.299e+02,1.299e+02,1.299e+02,1.299e+02  2/2:0,0,8,0,0,0,0:0.000,1.000,0.000,0.000,0.000,0.000:8:20:indel:0,0,1,0,0,0,0:0,0,7,0,0,0,0:77,359,78,23,24,0,359,78,24,78,359,78,24,78,78,359,78,24,78,78,78,359,78,24,78,78,78,78:7.043e+01,3.871e+02,1.089e+02,2.043e+01,8.666e+01,3.948e-02,3.871e+02,1.089e+02,8.666e+01,1.089e+02,3.871e+02,1.089e+02,8.666e+01,1.089e+02,1.089e+02,3.871e+02,1.089e+02,8.666e+01,1.089e+02,1.089e+02,1.089e+02,3.871e+02,1.089e+02,8.666e+01,1.089e+02,1.089e+02,1.089e+02,1.089e+02

Do you think that, there are some problems with our data?

Ah, you probably want split_multi_hts. The split_multi method only splits the variants into bi-allelic variants. This permits split_multi to also work on tables of variants. If you want to correctly update all the genotypes you need to use split_multi_hts. I’ll move the warning in split_multi up higher so its more obvious.

Thank @danking. the previous issue has gone. But when looking at elasticsearch database, I see following data

  {
    "_index": "var1",
    "_type": "qc",
    "_id": "uOhK3m4BQqcPURZpumre",
    "_score": 1,
    "_source": {
      "locus": {
        "contig": "chr1",
        "position": 10560
      },
      "alleles": [
        "C",
        "G"
      ],
      "qual": 17.28,
      "info.AC": [
        1
      ],
      "info.AF": [
        0.042
      ],
      "info.AN": 24,
      "info.DB": false,
      "info.DP": 142,
      "info.FS": 0,
      "info.MQ": 73.62,
      "info.MQRankSum": -0.922,
      "info.QD": 2.47,
      "info.ReadPosRankSum": 0.198,
      "info.SOR": 1.802,
      "info.NML": false,
      "info.FGT": false,
      "a_index": 1,
      "was_split": false,
      "variant_qc.dp_stats.mean": 11.833333333333334,
      "variant_qc.dp_stats.stdev": 5.941847823325295,
      "variant_qc.dp_stats.min": 4,
      "variant_qc.dp_stats.max": 27,
      "variant_qc.gq_stats.mean": 31.416666666666668,
      "variant_qc.gq_stats.stdev": 17.988229176016436,
      "variant_qc.gq_stats.min": 0,
      "variant_qc.gq_stats.max": 75,
      "variant_qc.AC": [
        23,
        1
      ],
      "variant_qc.AF": [
        0.9583333333333334,
        0.041666666666666664
      ],
      "variant_qc.AN": 24,
      "variant_qc.homozygote_count": [
        11,
        0
      ],
      "variant_qc.call_rate": 1,
      "variant_qc.n_called": 12,
      "variant_qc.n_not_called": 0,
      "variant_qc.n_filtered": 0,
      "variant_qc.n_het": 1,
      "variant_qc.n_non_ref": 1,
      "variant_qc.het_freq_hwe": 0.08333333333333333,
      "variant_qc.p_value_hwe": 0.5,
      "vep.assembly_name": "GRCh38",
      "vep.allele_string": "C/G",
      "vep.colocated_variants": [
        {
          "allele_string": "C/G",
          "end": 10560,
          "id": "rs1379919052",
          "start": 10560,
          "strand": 1
        }
      ],
      "vep.end": 10560,
      "vep.id": ".",
      "vep.input": "chr1\t10560\t.\tC\tG\t.\t.\tGT",
      "vep.most_severe_consequence": "upstream_gene_variant",
      "vep.seq_region_name": "chr1",
      "vep.start": 10560,
      "vep.strand": 1,
      "vep.transcript_consequences": [
        {
          "allele_num": 1,
          "biotype": "transcribed_unprocessed_pseudogene",
          "consequence_terms": [
            "upstream_gene_variant"
          ],
          "distance": 1450,
          "gene_id": "ENSG00000223972",
          "gene_symbol": "DDX11L1",
          "gene_symbol_source": "HGNC",
          "hgnc_id": "HGNC:37102",
          "hgvsg": "chr1:g.10560C>G",
          "impact": "MODIFIER",
          "strand": 1,
          "transcript_id": "ENST00000450305",
          "variant_allele": "G"
        },
        {
          "allele_num": 1,
          "biotype": "processed_transcript",
          "canonical": 1,
          "consequence_terms": [
            "upstream_gene_variant"
          ],
          "distance": 1309,
          "gene_id": "ENSG00000223972",
          "gene_symbol": "DDX11L1",
          "gene_symbol_source": "HGNC",
          "hgnc_id": "HGNC:37102",
          "hgvsg": "chr1:g.10560C>G",
          "impact": "MODIFIER",
          "strand": 1,
          "transcript_id": "ENST00000456328",
          "tsl": 1,
          "variant_allele": "G"
        },
        {
          "allele_num": 1,
          "biotype": "unprocessed_pseudogene",
          "canonical": 1,
          "consequence_terms": [
            "downstream_gene_variant"
          ],
          "distance": 3844,
          "gene_id": "ENSG00000227232",
          "gene_symbol": "WASH7P",
          "gene_symbol_source": "HGNC",
          "hgnc_id": "HGNC:38034",
          "hgvsg": "chr1:g.10560C>G",
          "impact": "MODIFIER",
          "strand": -1,
          "transcript_id": "ENST00000488147",
          "variant_allele": "G"
        }
      ],
      "vep.variant_class": "SNV"
    }
  },

It doesn’t have enough fields, which were defined in the configuration file above

"vep_json_schema": "Struct{assembly_name:String,allele_string:String,ancestral:String,colocated_variants:Array[Struct{aa_allele:String,aa_maf:Float64,afr_allele:String,afr_maf:Float64,allele_string:String,amr_allele:String,amr_maf:Float64,clin_sig:Array[String],end:Int32,eas_allele:String,eas_maf:Float64,ea_allele:String,ea_maf:Float64,eur_allele:String,eur_maf:Float64,exac_adj_allele:String,exac_adj_maf:Float64,exac_allele:String,exac_afr_allele:String,exac_afr_maf:Float64,exac_amr_allele:String,exac_amr_maf:Float64,exac_eas_allele:String,exac_eas_maf:Float64,exac_fin_allele:String,exac_fin_maf:Float64,exac_maf:Float64,exac_nfe_allele:String,exac_nfe_maf:Float64,exac_oth_allele:String,exac_oth_maf:Float64,exac_sas_allele:String,exac_sas_maf:Float64,id:String,minor_allele:String,minor_allele_freq:Float64,phenotype_or_disease:Int32,pubmed:Array[Int32],sas_allele:String,sas_maf:Float64,somatic:Int32,start:Int32,strand:Int32}],context:String,end:Int32,id:String,input:String,intergenic_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],impact:String,minimised:Int32,variant_allele:String}],most_severe_consequence:String,motif_feature_consequences:Array[Struct{allele_num:Int32,consequence_terms:Array[String],high_inf_pos:String,impact:String,minimised:Int32,motif_feature_id:String,motif_name:String,motif_pos:Int32,motif_score_change:Float64,strand:Int32,variant_allele:String}],regulatory_feature_consequences:Array[Struct{allele_num:Int32,biotype:String,consequence_terms:Array[String],impact:String,minimised:Int32,regulatory_feature_id:String,variant_allele:String}],seq_region_name:String,start:Int32,strand:Int32,transcript_consequences:Array[Struct{allele_num:Int32,amino_acids:String,appris:String,biotype:String,canonical:Int32,ccds:String,cdna_start:Int32,cdna_end:Int32,cds_end:Int32,cds_start:Int32,codons:String,consequence_terms:Array[String],distance:Int32,domains:Array[Struct{db:String,name:String}],exon:String,gene_id:String,gene_pheno:Int32,gene_symbol:String,gene_symbol_source:String,hgnc_id:String,hgvsg:String,hgvsc:String,hgvsp:String,hgvs_offset:Int32,impact:String,intron:String,lof:String,lof_flags:String,lof_filter:String,lof_info:String,minimised:Int32,polyphen_prediction:String,polyphen_score:Float64,protein_end:Int32,protein_start:Int32,protein_id:String,sift_prediction:String,sift_score:Float64,strand:Int32,swissprot:String,transcript_id:String,trembl:String,tsl:Int32,uniparc:String,variant_allele:String}],variant_class:String}"

}

And hail still shows a lot of warning logs when I export data to elasticsearch database

The warnings mean that those fields were missing from the VEP output. If those fields are important to you, you’ll need to understand why VEP is not producing them. If those fields are not important, then you need not worry.

The fields that are missing are probably missing in your table. For example, I suspect this:

t.filter(
    t.vep.colocated_variants.any(lambda x: hl.is_defined(x.aa_maf))
).count()

will return 0.

1 Like

Thanks @danking