Is it possible for me to load a specific version of Hail 0.2 when starting up the gcp cluster?
It seems like for Hail 0.2 there are (almost) daily commits (which is great to see, that you’re optimizing thing really fast), but these commits have a high chance it’ll break something in my pipeline - ideally I’ll be able to stick to a build that I tested will work for my use case for a reasonable timeline.
Are you using cloudtools? If so, you can use the --hash argument to select a specific git hash. Note that they have to be 12-digit hashes, like b76333115a3f.
Great, thanks! Maybe a really naive question, how can I check the 12-digit hash of a particular version? (A version currently running in a gcp cluster, for instance)
We have a deployment issue where builds are sometimes 1 behind the labelled hash. The specific commit there doesn’t change anything significant, so that version and the one it’s labelled as should function the same.
I think we have a fix for the underlying issue, though!
I get the following error in devel-477edb9, which I don’t get in in devel-3959178:
File “/tmp/623d02edd3ba46e2bdbc20bf61275614/GTEx_v8_eQTL_pipeline_combined.py”, line 162, in
analysis_set = tissue_ds.filter_rows(tissue_ds.locus.contig != ‘chr’ + chrom).repartition(200)
File “”, line 2, in repartition
File “/home/hail/hail.zip/hail/typecheck/check.py”, line 486, in _typecheck
File “/home/hail/hail.zip/hail/matrixtable.py”, line 2507, in repartition
File “/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/home/hail/hail.zip/hail/utils/java.py”, line 196, in deco
hail.utils.java.FatalError: FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)
Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 131.0 failed 20 times, most recent failure: Lost task 6.19 in stage 131.0 (TID 268, hail-3-w-0.c.gtex-v8.internal, executor 10): java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:132)
at is.hail.rvd.OrderedRVD$.getPartitionKeyInfo(OrderedRVD.scala:479)
at is.hail.rvd.OrderedRVD.coalesce(OrderedRVD.scala:186)
at is.hail.variant.MatrixTable.coalesce(MatrixTable.scala:2073)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Hail version: devel-477edb9
Error summary: FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)
These were run in the same cluster configuration, which I find bizarre since I never had to increase the default disk space allocation until now. But you’re probably right in that maybe I should try increasing the disk space size in the cluster startup - let me try this and get back to you.