Problems with elastic search

I am running the example version 1:
for: hail_scripts/utils/elasticsearch_client.py

My environment is a local spark cluster with:
spark-2.0.2-bin-hadoop2.7
Hail 0.1
Elastic Search 6.4.0
I got two errors:

  1. Which version I should use for elasticsearch?
    Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting ‘es.nodes.wan.only’

  2. Why I get error in summary = vds.summarize()
    pipelines/hail_scripts/load_clinvar_to_es_pipeline.py", line 96, in <module>

    summary = vds.summarize()

File “<decorator-gen-442>”, line 2, in summarize

File “/nethome/juarezespinosoh/hail/spark/hail_repo_point_one/hail/build/distributions/hail-python.zip/hail/java.py”, line 121, in handle_py4j

hail.java.FatalError: NumberFormatException: For input string: “NW_003315947.1”

Any advice:
for elasticsearch version and correct vds.summarize()?
Thanks,
Octavio

I’m not sure about elastic search, I’ll have to let someone else answer that.

For the summarize error, can you post the full stack trace? This looks like you’re trying to read a sample identifier string as an integer.

Hi Tim:
Here is the trace…

File “/nethome/juarezespinosoh/pipelines_hail/hail-elasticsearch-pipelines/hail_scripts/load_clinvar_to_es_pipeline.py”, line 96, in <module>

summary = vds.summarize()

File “<decorator-gen-442>”, line 2, in summarize

File “/nethome/juarezespinosoh/hail/spark/hail_repo_point_one/hail/build/distributions/hail-python.zip/hail/java.py”, line 121, in handle_py4j

hail.java.FatalError: NumberFormatException: For input string: “NW_003315947.1”

Java stack trace:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 15 in stage 3.0 failed 1 times, most recent failure: Lost task 15.0 in stage 3.0 (TID 63, localhost): java.lang.NumberFormatException: For input string: “NW_003315947.1”

   at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

   at java.lang.Integer.parseInt(Integer.java:580)

   at java.lang.Integer.parseInt(Integer.java:615)

   at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)

   at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)

   at is.hail.expr.FunctionRegistry$$anonfun$312.apply(FunctionRegistry.scala:2510)

   at is.hail.expr.FunctionRegistry$$anonfun$312.apply(FunctionRegistry.scala:2510)

   at is.hail.expr.UnaryFun.apply(Fun.scala:121)

   at is.hail.codegen.generated.C9.apply(Unknown Source)

   at is.hail.codegen.generated.C9.apply(Unknown Source)

   at is.hail.expr.CM$$anonfun$runWithDelayedValues$1.apply(CM.scala:82)

   at is.hail.expr.CM$$anonfun$runWithDelayedValues$1.apply(CM.scala:80)

   at is.hail.expr.Parser$$anonfun$is$hail$expr$Parser$$evalNoTypeCheck$1.apply(Parser.scala:53)

   at is.hail.expr.Parser$$anonfun$parseNamedExprs$2$$anonfun$apply$2.apply$mcV$sp(Parser.scala:228)

   at is.hail.expr.Parser$$anonfun$parseNamedExprs$3$$anonfun$apply$15.apply(Parser.scala:239)

   at is.hail.expr.Parser$$anonfun$parseNamedExprs$3$$anonfun$apply$15.apply(Parser.scala:239)

   at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)

   at is.hail.expr.Parser$$anonfun$parseNamedExprs$3.apply(Parser.scala:239)

   at is.hail.expr.Parser$$anonfun$parseNamedExprs$3.apply(Parser.scala:238)

   at is.hail.expr.Parser$$anonfun$parseAnnotationExprs$4.apply(Parser.scala:125)

   at is.hail.expr.Parser$$anonfun$parseAnnotationExprs$4.apply(Parser.scala:124)

   at is.hail.variant.VariantSampleMatrix$$anonfun$57.apply(VariantSampleMatrix.scala:892)

   at is.hail.variant.VariantSampleMatrix$$anonfun$57.apply(VariantSampleMatrix.scala:888)

   at is.hail.variant.VariantSampleMatrix$$anonfun$mapAnnotations$1.apply(VariantSampleMatrix.scala:1620)

   at is.hail.variant.VariantSampleMatrix$$anonfun$mapAnnotations$1.apply(VariantSampleMatrix.scala:1620)

   at is.hail.utils.richUtils.RichPairRDD$$anonfun$mapValuesWithKey$extension$1$$anonfun$apply$1.apply(RichPairRDD.scala:18)

   at is.hail.utils.richUtils.RichPairRDD$$anonfun$mapValuesWithKey$extension$1$$anonfun$apply$1.apply(RichPairRDD.scala:18)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$class.foreach(Iterator.scala:893)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.foreach(OrderedRDD.scala:195)

   at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.foldLeft(OrderedRDD.scala:195)

   at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.aggregate(OrderedRDD.scala:195)

   at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1089)

   at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1089)

   at org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1935)

   at org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1935)

   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)

   at org.apache.spark.scheduler.Task.run(Task.scala:86)

   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

   at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:

   at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)

   at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)

   at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)

   at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

   at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)

   at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)

   at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)

   at scala.Option.foreach(Option.scala:257)

   at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)

   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)

   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)

   at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)

   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)

   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)

   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1936)

   at org.apache.spark.rdd.RDD$$anonfun$aggregate$1.apply(RDD.scala:1091)

   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

   at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)

   at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1084)

   at is.hail.variant.VariantDatasetFunctions$.summarize$extension(VariantDataset.scala:220)

   at is.hail.variant.VariantDatasetFunctions.summarize(VariantDataset.scala:221)

   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

   at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

   at java.lang.reflect.Method.invoke(Method.java:498)

   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)

   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

   at py4j.Gateway.invoke(Gateway.java:280)

   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)

   at py4j.commands.CallCommand.execute(CallCommand.java:79)

   at py4j.GatewayConnection.run(GatewayConnection.java:214)

   at java.lang.Thread.run(Thread.java:748)java.lang.NumberFormatException: For input string: "NW_003315947.1"

   at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

   at java.lang.Integer.parseInt(Integer.java:580)

   at java.lang.Integer.parseInt(Integer.java:615)

   at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)

   at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)

   at is.hail.expr.FunctionRegistry$$anonfun$312.apply(FunctionRegistry.scala:2510)

   at is.hail.expr.FunctionRegistry$$anonfun$312.apply(FunctionRegistry.scala:2510)

   at is.hail.expr.UnaryFun.apply(Fun.scala:121)

   at is.hail.codegen.generated.C9.apply(Unknown Source)

   at is.hail.codegen.generated.C9.apply(Unknown Source)

   at is.hail.expr.CM$$anonfun$runWithDelayedValues$1.apply(CM.scala:82)

   at is.hail.expr.CM$$anonfun$runWithDelayedValues$1.apply(CM.scala:80)

   at is.hail.expr.Parser$$anonfun$is$hail$expr$Parser$$evalNoTypeCheck$1.apply(Parser.scala:53)

   at is.hail.expr.Parser$$anonfun$parseNamedExprs$2$$anonfun$apply$2.apply$mcV$sp(Parser.scala:228)

   at is.hail.expr.Parser$$anonfun$parseNamedExprs$3$$anonfun$apply$15.apply(Parser.scala:239)

   at is.hail.expr.Parser$$anonfun$parseNamedExprs$3$$anonfun$apply$15.apply(Parser.scala:239)

   at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)

   at is.hail.expr.Parser$$anonfun$parseNamedExprs$3.apply(Parser.scala:239)

   at is.hail.expr.Parser$$anonfun$parseNamedExprs$3.apply(Parser.scala:238)

   at is.hail.expr.Parser$$anonfun$parseAnnotationExprs$4.apply(Parser.scala:125)

   at is.hail.expr.Parser$$anonfun$parseAnnotationExprs$4.apply(Parser.scala:124)

   at is.hail.variant.VariantSampleMatrix$$anonfun$57.apply(VariantSampleMatrix.scala:892)

   at is.hail.variant.VariantSampleMatrix$$anonfun$57.apply(VariantSampleMatrix.scala:888)

   at is.hail.variant.VariantSampleMatrix$$anonfun$mapAnnotations$1.apply(VariantSampleMatrix.scala:1620)

   at is.hail.variant.VariantSampleMatrix$$anonfun$mapAnnotations$1.apply(VariantSampleMatrix.scala:1620)

   at is.hail.utils.richUtils.RichPairRDD$$anonfun$mapValuesWithKey$extension$1$$anonfun$apply$1.apply(RichPairRDD.scala:18)

   at is.hail.utils.richUtils.RichPairRDD$$anonfun$mapValuesWithKey$extension$1$$anonfun$apply$1.apply(RichPairRDD.scala:18)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:202)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.next(OrderedRDD.scala:195)

   at scala.collection.Iterator$class.foreach(Iterator.scala:893)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.foreach(OrderedRDD.scala:195)

   at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.foldLeft(OrderedRDD.scala:195)

   at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)

   at is.hail.sparkextras.OrderedRDD$$anonfun$apply$8$$anon$2.aggregate(OrderedRDD.scala:195)

   at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1089)

   at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1089)

   at org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1935)

   at org.apache.spark.SparkContext$$anonfun$32.apply(SparkContext.scala:1935)

   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)

   at org.apache.spark.scheduler.Task.run(Task.scala:86)

   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

   at java.lang.Thread.run(Thread.java:748)

Hail version: 0.1-74bf1eb

Error summary: NumberFormatException: For input string: “NW_003315947.1”

Could you please recommend a good procedure to install the pipelines using google cloud? Or amazon cloud?

The error I got in vds.summarize() was produced after:
/nethome/juarezespinosoh/hail/spark/spark-2.0.2-bin-hadoop2.7/bin/spark-submit --master spark://ai-grisnodedev1:7077 --jars /nethome/juarezespinosoh/hail/spark/hail_repo_point_one/hail/build/libs/hail-all-spark.jar --py-files /nethome/juarezespinosoh/hail/spark/hail_repo_point_one/hail/build/distributions/hail-python.zip hail_scripts/load_clinvar_to_es_pipeline2.py -g 37 -H localhost -p 9200 -i clinvar -t variant -b 200 -s 1

ah, 0.1. Hmm. Can we get you switched over to 0.2? We’re going to fully deprecate 0.1 in a few weeks.

Re: the error, I need to see the full python script to know what’s causing that NumberFormatException.

Re: google / amazon cloud, we have a tool used to install and run Hail on Google Dataproc (Spark product). This makes it super easy! https://github.com/Nealelab/cloudtools/

The reason I am using 0.1 (was because it was described as stable) and the pipelines were working with 0.1.
Question: the pipelines work now with hail 0.2?
Reference to:

This code did not work (the script is below):
Using this command:
/nethome/juarezespinosoh/hail/spark/spark-2.0.2-bin-hadoop2.7/bin/spark-submit --master spark://ai-grisnodedev1:7077 --jars /nethome/juarezespinosoh/hail/spark/hail_repo_point_one/hail/build/libs/hail-all-spark.jar --py-files /nethome/juarezespinosoh/hail/spark/hail_repo_point_one/hail/build/distributions/hail-python.zip hail_scripts/load_clinvar_to_es_pipeline.py -g 37 -H localhost -p 9200 -i clinvar -t variant -b 200 -s 1

import argparse
import hail
from pprint import pprint
from hail_scripts.utils.add_clinvar import CLINVAR_GOLD_STARS_LOOKUP, CLINVAR_VDS_PATH,
download_and_import_latest_clinvar_vcf
from hail_scripts.utils.computed_fields import get_expr_for_contig, get_expr_for_start_pos, get_expr_for_ref_allele,
get_expr_for_alt_allele, get_expr_for_variant_id, get_expr_for_worst_transcript_consequence_annotations_struct,
get_expr_for_xpos, get_expr_for_vep_gene_ids_set, get_expr_for_vep_transcript_ids_set,
get_expr_for_vep_sorted_transcript_consequences_array, get_expr_for_vep_protein_domains_set,
get_expr_for_vep_consequence_terms_set, get_expr_for_vep_transcript_id_to_consequence_map
from hail_scripts.utils.elasticsearch_client import ElasticsearchClient
from hail_scripts.utils.vds_utils import run_vep

p = argparse.ArgumentParser()
p.add_argument("-g", “–genome-version”, help=“Genome build: 37 or 38”, choices=[“37”, “38”], required=True)
p.add_argument("-H", “–host”, help=“Elasticsearch host or IP”, required=True)
p.add_argument("-p", “–port”, help=“Elasticsearch port”, default=9200, type=int)
p.add_argument("-i", “–index-name”, help=“Elasticsearch index name”)
p.add_argument("-t", “–index-type”, help=“Elasticsearch index type”, default=“variant”)
p.add_argument("-s", “–num-shards”, help=“Number of elasticsearch shards”, default=1, type=int)
p.add_argument("-b", “–block-size”, help=“Elasticsearch block size to use when exporting”, default=200, type=int)
p.add_argument("–subset", help=“Specify an interval (eg. X:12345-54321 to load a subset of clinvar”)
args = p.parse_args()

client = ElasticsearchClient(
host=args.host,
port=args.port,
)

if args.index_name:
index_name = args.index_name.lower()
else:
index_name = “clinvar_grch{}”.format(args.genome_version)

hc = hail.HailContext(log=“hail.log”)

download vcf

vds = download_and_import_latest_clinvar_vcf(hc, args.genome_version)

run VEP

print("\n=== Running VEP ===")
vds = run_vep(vds, root=‘va.vep’, block_size=1000)

#pprint(vds.variant_schema)

computed_annotation_exprs = [
#“va.docId = %s” % get_expr_for_variant_id(512),
“va.chrom = %s” % get_expr_for_contig(),
“va.pos = %s” % get_expr_for_start_pos(),
“va.ref = %s” % get_expr_for_ref_allele(),
“va.alt = %s” % get_expr_for_alt_allele(),
“va.xpos = %s” % get_expr_for_xpos(pos_field=“start”),

"va.variantId = %s" % get_expr_for_variant_id(),
"va.domains = %s" % get_expr_for_vep_protein_domains_set(vep_transcript_consequences_root="va.vep.transcript_consequences"),
"va.sortedTranscriptConsequences = %s" % get_expr_for_vep_sorted_transcript_consequences_array(vep_root="va.vep"),
"va.transcriptConsequenceTerms = %s" % get_expr_for_vep_consequence_terms_set(vep_transcript_consequences_root="va.sortedTranscriptConsequences"),
"va.transcriptIds = %s" % get_expr_for_vep_transcript_ids_set(vep_transcript_consequences_root="va.sortedTranscriptConsequences"),
"va.transcriptIdToConsequenceMap = %s" % get_expr_for_vep_transcript_id_to_consequence_map(vep_transcript_consequences_root="va.sortedTranscriptConsequences"),
"va.geneIds = %s" % get_expr_for_vep_gene_ids_set(vep_transcript_consequences_root="va.sortedTranscriptConsequences", exclude_upstream_downstream_genes=True),
"va.mainTranscript = %s" % get_expr_for_worst_transcript_consequence_annotations_struct("va.sortedTranscriptConsequences"),
#"va.codingGeneIds = %s" % get_expr_for_vep_gene_ids_set(vep_transcript_consequences_root="va.sortedTranscriptConsequences", only_coding_genes=True, exclude_upstream_downstream_

genes=True),
#“va.sortedTranscriptConsequences = json(va.sortedTranscriptConsequences)”,
]

for expr in computed_annotation_exprs:
vds = vds.annotate_variants_expr(expr)

expr = “”"
va.clean.variant_id = va.variantId,

va.clean.chrom = va.chrom,
va.clean.pos = va.pos,
va.clean.ref = va.ref,

va.clean.xpos = va.xpos,

va.clean.transcript_consequence_terms = va.transcriptConsequenceTerms,
va.clean.domains = va.domains,
va.clean.transcript_ids = va.transcriptIds,
va.clean.gene_ids = va.geneIds,
va.clean.transcript_id_to_consequence_json = va.transcriptIdToConsequenceMap,
va.clean.main_transcript = va.mainTranscript,
va.clean.allele_id = va.info.ALLELEID,
va.clean.clinical_significance = va.info.CLNSIG.toSet.mkString(","),
va.clean.review_status = va.info.CLNREVSTAT.toSet.mkString(","),
va.clean.gold_stars = {}.get(va.info.CLNREVSTAT.toSet.mkString(","))

“”".format(CLINVAR_GOLD_STARS_LOOKUP)

vds = vds.annotate_variants_expr(expr=expr)
vds = vds.annotate_variants_expr(“va = va.clean”)

pprint(vds.variant_schema)

#summary = vds.summarize()
#print("\nSummary:" + str(summary))

print(“Export to elasticsearch”)
client.export_vds_to_elasticsearch(
vds,
index_name=index_name,
index_type_name=args.index_type,
block_size=args.block_size,
num_shards=args.num_shards,
delete_index_before_exporting=True,
#elasticsearch_mapping_id=“doc_id”,
is_split_vds=True,
verbose=True,
export_globals_to_index_meta=True,
)

I will try to install the google cloud tools …

Ah, if you’re using the Seqr stuff then you’ll have to use 0.1 – this won’t work in 0.2.

Thanks. Did you see any error in the script I sent? The one that produced the errors in summarize(). Thanks

I think the problem is somewhere in here. I can’t really tell you where, since I’m not familiar with the Seqr codebase… sorry.

I’ll ask them when I get a chance what their timeline for updating to 0.2 is.

import hail
from pprint import pprint

hc = hail.HailContext()
#/var/lib/spark/vep/vep.properties
vds = hc.import_vcf(’/sample.vcf’)
vds = vds.vep(config="/nethome/juarezespinosoh/pipelines_hail/hail-elasticsearch-pipelines/vep.properties", root=‘va.vep’, block_size=1000)
vds = (
hc
.import_vcf(’/sample.vcf’)
.split_multi()
.annotate_variants_db(‘va.vep’)
.annotate_variants_expr(‘va.my_gene = va.vep.transcript_consequences[0].gene_symbol’)
#.annotate_variants_db(‘va.gene.constraint.pli’, gene_key=‘va.my_gene’)
)

pprint(vds.variant_schema)

vds.summarize().report()

Why I get an error in hail 0.1? in this line? .annotate_variants_db(‘va.vep’)

Can I change the access to the google database?

What’s the full error message, including stack trace? I’m going to ask that every time :slight_smile:

If you’re not actually running on Google Dataproc, you won’t be able to use the Annotation DB. I think that’s the problem.

Thanks. Here is the stack for an error that refers to aIndex. This is when I am trying to translate ves to elastic_search.
Traceback (most recent call last):
File “/nethome/juarezespinosoh/pipelines_hail/hail-elasticsearch-pipelines/test_hail/example3.py”, line 27, in
verbose=True,
File “/nethome/juarezespinosoh/pipelines_hail/hail-elasticsearch-pipelines/hail_scripts/utils/elasticsearch_client.py”, line 297, in export_vds_to_elasticsearch
genotype_fields_to_export,
File “”, line 2, in make_table
File “/nethome/juarezespinosoh/hail/spark/hail_repo_point_one/hail/build/distributions/hail-python.zip/hail/java.py”, line 121, in handle_py4j
hail.java.FatalError: HailException: Struct has no field `aIndex’
Available fields:
rsid: String
qual: Double
filters: Set[String]
info: Struct{NEGATIVE_TRAIN_SITE:Boolean,HWP:Double,AC:Array[Int],culprit:String,MQ0:Int,ReadPosRankSum:Double,AN:Int,InbreedingCoeff:Double,AF:Array[Double],GQ_STDDEV:Double,FS:Double,DP:Int,GQ_MEAN:Double,POSITIVE_TRAIN_SITE:Boolean,VQSLOD:Double,ClippingRankSum:Double,BaseQRankSum:Double,MLEAF:Array[Double],MLEAC:Array[Int],MQ:Double,QD:Double,END:Int,DB:Boolean,HaplotypeScore:Double,MQRankSum:Double,CCC:Int,NCC:Int,DS:Boolean}
:1:filters = va.filters,info_AC = va.info.AC[va.aIndex-1],info_AF = va.info.AF[va.aIndex-1],info_AN = va.info.AN,info_BaseQRankSum = va.info.BaseQRankSum,info_CCC = va.info.CCC,info_ClippingRankSum = va.info.ClippingRankSum,info_DB = va.info.DB,info_DP = va.info.DP,info_DS = va.info.DS,info_END = va.info.END,info_FS = va.info.FS,info_GQ_MEAN = va.info.GQ_MEAN,info_GQ_STDDEV = va.info.GQ_STDDEV,info_HWP = va.info.HWP,info_HaplotypeScore = va.info.HaplotypeScore,info_InbreedingCoeff = va.info.InbreedingCoeff,info_MLEAC = va.info.MLEAC[va.aIndex-1],info_MLEAF = va.info.MLEAF[va.aIndex-1],info_MQ = va.info.MQ,info_MQ0 = va.info.MQ0,info_MQRankSum = va.info.MQRankSum,info_NCC = va.info.NCC,info_NEGATIVE_TRAIN_SITE = va.info.NEGATIVE_TRAIN_SITE,info_POSITIVE_TRAIN_SITE = va.info.POSITIVE_TRAIN_SITE,info_QD = va.info.QD,info_ReadPosRankSum = va.info.ReadPosRankSum,info_VQSLOD = va.info.VQSLOD,info_culprit = va.info.culprit,qual = va.qual,rsid = va.rsid
^

Java stack trace:
is.hail.utils.HailException: Struct has no field `aIndex’
Available fields:
rsid: String
qual: Double
filters: Set[String]
info: Struct{NEGATIVE_TRAIN_SITE:Boolean,HWP:Double,AC:Array[Int],culprit:String,MQ0:Int,ReadPosRankSum:Double,AN:Int,InbreedingCoeff:Double,AF:Array[Double],GQ_STDDEV:Double,FS:Double,DP:Int,GQ_MEAN:Double,POSITIVE_TRAIN_SITE:Boolean,VQSLOD:Double,ClippingRankSum:Double,BaseQRankSum:Double,MLEAF:Array[Double],MLEAC:Array[Int],MQ:Double,QD:Double,END:Int,DB:Boolean,HaplotypeScore:Double,MQRankSum:Double,CCC:Int,NCC:Int,DS:Boolean}
:1:filters = va.filters,info_AC = va.info.AC[va.aIndex-1],info_AF = va.info.AF[va.aIndex-1],info_AN = va.info.AN,info_BaseQRankSum = va.info.BaseQRankSum,info_CCC = va.info.CCC,info_ClippingRankSum = va.info.ClippingRankSum,info_DB = va.info.DB,info_DP = va.info.DP,info_DS = va.info.DS,info_END = va.info.END,info_FS = va.info.FS,info_GQ_MEAN = va.info.GQ_MEAN,info_GQ_STDDEV = va.info.GQ_STDDEV,info_HWP = va.info.HWP,info_HaplotypeScore = va.info.HaplotypeScore,info_InbreedingCoeff = va.info.InbreedingCoeff,info_MLEAC = va.info.MLEAC[va.aIndex-1],info_MLEAF = va.info.MLEAF[va.aIndex-1],info_MQ = va.info.MQ,info_MQ0 = va.info.MQ0,info_MQRankSum = va.info.MQRankSum,info_NCC = va.info.NCC,info_NEGATIVE_TRAIN_SITE = va.info.NEGATIVE_TRAIN_SITE,info_POSITIVE_TRAIN_SITE = va.info.POSITIVE_TRAIN_SITE,info_QD = va.info.QD,info_ReadPosRankSum = va.info.ReadPosRankSum,info_VQSLOD = va.info.VQSLOD,info_culprit = va.info.culprit,qual = va.qual,rsid = va.rsid
^
at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:6)
at is.hail.utils.package$.fatal(package.scala:27)
at is.hail.expr.ParserUtils$.error(Parser.scala:24)
at is.hail.expr.AST.parseError(AST.scala:238)
at is.hail.expr.Select.typecheckThis(AST.scala:303)
at is.hail.expr.AST.typecheckThis(AST.scala:229)
at is.hail.expr.AST.typecheck(AST.scala:235)
at is.hail.expr.AST$$anonfun$typecheck$1.apply(AST.scala:234)
at is.hail.expr.AST$$anonfun$typecheck$1.apply(AST.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at is.hail.expr.AST.typecheck(AST.scala:234)
at is.hail.expr.Apply.typecheck(AST.scala:520)
at is.hail.expr.AST$$anonfun$typecheck$1.apply(AST.scala:234)
at is.hail.expr.AST$$anonfun$typecheck$1.apply(AST.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at is.hail.expr.AST.typecheck(AST.scala:234)
at is.hail.expr.ApplyMethod.typecheck(AST.scala:613)
at is.hail.expr.Parser$$anonfun$11.apply(Parser.scala:168)
at is.hail.expr.Parser$$anonfun$11.apply(Parser.scala:167)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at is.hail.expr.Parser$.parseNamedExprs(Parser.scala:167)
at is.hail.expr.Parser$.parseNamedExprs(Parser.scala:150)
at is.hail.variant.VariantSampleMatrix.makeKT(VariantSampleMatrix.scala:1547)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Hail version: 0.1-74bf1eb
Error summary: HailException: Struct has no field `aIndex’
Available fields:
rsid: String
qual: Double
filters: Set[String]
info: Struct{NEGATIVE_TRAIN_SITE:Boolean,HWP:Double,AC:Array[Int],culprit:String,MQ0:Int,ReadPosRankSum:Double,AN:Int,InbreedingCoeff:Double,AF:Array[Double],GQ_STDDEV:Double,FS:Double,DP:Int,GQ_MEAN:Double,POSITIVE_TRAIN_SITE:Boolean,VQSLOD:Double,ClippingRankSum:Double,BaseQRankSum:Double,MLEAF:Array[Double],MLEAC:Array[Int],MQ:Double,QD:Double,END:Int,DB:Boolean,HaplotypeScore:Double,MQRankSum:Double,CCC:Int,NCC:Int,DS:Boolean}
:1:filters = va.filters,info_AC = va.info.AC[va.aIndex-1],info_AF = va.info.AF[va.aIndex-1],info_AN = va.info.AN,info_BaseQRankSum = va.info.BaseQRankSum,info_CCC = va.info.CCC,info_ClippingRankSum = va.info.ClippingRankSum,info_DB = va.info.DB,info_DP = va.info.DP,info_DS = va.info.DS,info_END = va.info.END,info_FS = va.info.FS,info_GQ_MEAN = va.info.GQ_MEAN,info_GQ_STDDEV = va.info.GQ_STDDEV,info_HWP = va.info.HWP,info_HaplotypeScore = va.info.HaplotypeScore,info_InbreedingCoeff = va.info.InbreedingCoeff,info_MLEAC = va.info.MLEAC[va.aIndex-1],info_MLEAF = va.info.MLEAF[va.aIndex-1],info_MQ = va.info.MQ,info_MQ0 = va.info.MQ0,info_MQRankSum = va.info.MQRankSum,info_NCC = va.info.NCC,info_NEGATIVE_TRAIN_SITE = va.info.NEGATIVE_TRAIN_SITE,info_POSITIVE_TRAIN_SITE = va.info.POSITIVE_TRAIN_SITE,info_QD = va.info.QD,info_ReadPosRankSum = va.info.ReadPosRankSum,info_VQSLOD = va.info.VQSLOD,info_culprit = va.info.culprit,qual = va.qual,rsid = va.rsid

Few more questions:
how can install my annotations database locally?
The reason I want to install locally hail is because the personal information security issues. Any advice on this?
How expensive is to use hail using cloud resources regular bases? I am not sure about cloud costs?
Can be installed on AWS?

how can install my annotations database locally?

This isn’t possible, that resource is provided for GCP only.

personal information security issues. Any advice on this?

I realize that some IRBs won’t approve cloud storage, but in general I’d trust Google or Amazon’s security team over any single research institute’s.

How expensive is to use hail using cloud resources regular bases? I am not sure about cloud costs?

That totally depends on what you’re doing!

Can be installed on AWS?

Yes! You’ll want to look into EMR, which is Amazon’s Spark product.

Regarding your aIndex error, is the VCF already biallelic? If so, you don’t need to split.