Extracting gnomad counts into a vcf file

Hello, I wanted to extract the allele counts and homozygotes count from gnomad v211 and v31.

I followed the command from this thread and I was able to get the output in form of TSV, however I want the output in vcf format.

Trying to export the same as VCF gives me only loci and ref/alt.

The commands I used were -

import hail as hl
hl.init(default_reference='GRCh38',tmp_dir='/mnt/exome/tmp') # Needed for tmp_dir, else I run out of space
ht = hl.read_table(‘gnomad.genomes.r2.1.1.sites.liftover_grch38.ht’) 
ds = ht.select(gnomad_ac=ht.freq[0].AC,gnomad_homozygote_count=ht.freq[0].homozygote_count,)
hl.export_vcf(ds, 'gnomad_v211_genome.vcf.bgz')

Part of the discussion was from Gnomad allele frequency query - #9 by kvn95ss

There I wanted the data in a certain format chr start end ref alt AC Hom so I can integrate this with annovar. If the output is in VCF format I can let annovar create the appropriate file for my use.

In the VCF file, I get only locus and alleles -

CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    10067   .       T       TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC      .       .       .
chr1    10108   .       CAACCCT C       .       .       .

I’m not really sure how make AC and homozygotes appear in the INFO column. Is there any specific way I could load it up?

On a similar note, I tried to export the ht as a vcf, but it turns out doing that prints a similar file.

How can I extract the data in vcf format?

Hail exports the fields of struct info as INFO fields, the elements of set<str> filters as FILTERS, the value of str rsid as ID, and the value of float64 qual as QUAL. No other row fields are exported.
https://hail.is/docs/0.2/methods/impex.html#hail.methods.export_vcf

The table being exported needs to have an info field for the VCF to contain INFO fields.

Try replacing this line

ds = ht.select(gnomad_ac=ht.freq[0].AC,gnomad_homozygote_count=ht.freq[0].homozygote_count,)

with

ds = ht.select(
  info=hl.struct(
    gnomad_ac=ht.freq[0].AC,
    gnomad_homozygote_count=ht.freq[0].homozygote_count
  )
)

Hay @nawatts the command worked well, I was able to extract the file that I wanted!

I could not figure out how to load data as struct, this command does it easily. I think I should spend more time in Python :sweat:

Hello, I was able to extract the vcf for Gnomad v211 data, but I’m running out of memory when I’m trying to extract the same data from Gnomad v3.1 data.

Here is the last line of error. I can post the full error if it’s required -

Hail version: 0.2.61-3c86d3ba497a
Error summary: OutOfMemoryError: GC overhead limit exceeded

Now, I tried to allot 200GB of RAM with this command

hl.init(default_reference='GRCh38',tmp_dir='/mnt/exome/tmp',spark_conf='--executor-memory 200g')

But got this error -
TypeError: init: parameter 'spark_conf': expected (None or Mapping[str, str]), found str: --executor-memory 200g

I’m sure it’s just my syntax, but I’m not sure how to proceed. BTW the program is running locally, not on a cluster.

See here for how to set memory configurations using the PYSPARK_SUBMIT_ARGS environment var: Java Heap Space out of memory - #5 by yl3336

I set memory to 180G, still got this error -

Stage 0:>                                                    (0 + 56) / 115376]
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "refresh progress"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-1286>", line 2, in export_vcf
  File "/root/anaconda3/envs/hail/lib/python3.7/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/root/anaconda3/envs/hail/lib/python3.7/site-packages/hail/methods/impex.py", line 530, in export_vcf
    Env.backend().execute(ir.MatrixWrite(dataset._mir, writer))
  File "/root/anaconda3/envs/hail/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 98, in execute
    raise e
  File "/root/anaconda3/envs/hail/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 74, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/root/anaconda3/envs/hail/lib/python3.7/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/root/anaconda3/envs/hail/lib/python3.7/site-packages/hail/backend/py4j_backend.py", line 32, in deco
    'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
hail.utils.java.FatalError: SparkException: Job aborted due to stage failure: Task 23 in stage 0.0 failed 1 times, most recent failure: Lost task 23.0 in stage 0.0 (TID 23, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 310597 ms
Driver stacktrace:
Java stack trace:
org.apache.spark.SparkException: Job aborted.
        at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:100)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply$mcV$sp(PairRDDFunctions.scala:1013)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply(PairRDDFunctions.scala:1013)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply(PairRDDFunctions.scala:1013)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1012)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:970)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply(PairRDDFunctions.scala:968)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply(PairRDDFunctions.scala:968)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:968)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply$mcV$sp(RDD.scala:1517)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply(RDD.scala:1505)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply(RDD.scala:1505)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1505)
        at is.hail.utils.richUtils.RichRDD$.writeTable$extension(RichRDD.scala:78)
        at is.hail.io.vcf.ExportVCF$.apply(ExportVCF.scala:462)
        at is.hail.expr.ir.MatrixVCFWriter.apply(MatrixWriter.scala:321)
        at is.hail.expr.ir.WrappedMatrixWriter.apply(MatrixWriter.scala:40)
        at is.hail.expr.ir.Interpret$.run(Interpret.scala:825)
        at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
        at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
        at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
        at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
        at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
        at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
        at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
        at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
        at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
        at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
        at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
        at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
        at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
        at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
        at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
        at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
        at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
        at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:354)
        at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:338)
        at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:335)
        at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
        at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
        at is.hail.utils.package$.using(package.scala:618)
        at is.hail.annotations.Region$.scoped(Region.scala:18)
        at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
        at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:247)
        at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:335)
        at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:379)
        at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:377)
        at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
        at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:377)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

org.apache.spark.SparkException: Job aborted due to stage failure: Task 23 in stage 0.0 failed 1 times, most recent failure: Lost task 23.0 in stage 0.0 (TID 23, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 310597 ms
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
        at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply$mcV$sp(PairRDDFunctions.scala:1013)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply(PairRDDFunctions.scala:1013)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply(PairRDDFunctions.scala:1013)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1012)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:970)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply(PairRDDFunctions.scala:968)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply(PairRDDFunctions.scala:968)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:968)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply$mcV$sp(RDD.scala:1517)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply(RDD.scala:1505)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply(RDD.scala:1505)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1505)
        at is.hail.utils.richUtils.RichRDD$.writeTable$extension(RichRDD.scala:78)
        at is.hail.io.vcf.ExportVCF$.apply(ExportVCF.scala:462)
        at is.hail.expr.ir.MatrixVCFWriter.apply(MatrixWriter.scala:321)
        at is.hail.expr.ir.WrappedMatrixWriter.apply(MatrixWriter.scala:40)
        at is.hail.expr.ir.Interpret$.run(Interpret.scala:825)
        at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
        at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
        at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
        at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
        at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
        at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
        at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
        at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
        at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
        at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
        at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
        at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
        at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
        at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
        at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
        at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
        at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
        at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
        at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:354)
        at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:338)
        at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:335)
        at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:25)
        at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:23)
        at is.hail.utils.package$.using(package.scala:618)
        at is.hail.annotations.Region$.scoped(Region.scala:18)
        at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:23)
        at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:247)
        at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:335)
        at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:379)
        at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:377)
        at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
        at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:377)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)




Hail version: 0.2.61-3c86d3ba497a
Error summary: SparkException: Job aborted due to stage failure: Task 23 in stage 0.0 failed 1 times, most recent failure: Lost task 23.0 in stage 0.0 (TID 23, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 310597 ms
Driver stacktrace:

Do I need to throw in more ram? Can I somehow export the dataset in chromosome-wise order?

No, this is a different issue. What is your full pipeline?

Hay, turns out I hadn’t copied the PYSPARK_ARGUMENTS correctly. It worked when I coped the entire line and used my appropriate amount of memory.

Thanks!