Hail 0.2 - ability to load a specific past version?

Hello,

Is it possible for me to load a specific version of Hail 0.2 when starting up the gcp cluster?

It seems like for Hail 0.2 there are (almost) daily commits (which is great to see, that you’re optimizing thing really fast), but these commits have a high chance it’ll break something in my pipeline - ideally I’ll be able to stick to a build that I tested will work for my use case for a reasonable timeline.

Are you using cloudtools? If so, you can use the --hash argument to select a specific git hash. Note that they have to be 12-digit hashes, like b76333115a3f.

Great, thanks! Maybe a really naive question, how can I check the 12-digit hash of a particular version? (A version currently running in a gcp cluster, for instance)

hail.__version__ should do it. Might not be 12 digits, though…

but if you post whatever you get I can look it up in the git history.

devel-3959178 this is what I’m getting, which is also visible in startup message.

here’s the full hash: 395917870b78a5ef8d173f52ad0b1e8a57239e01
and the 12: 395917870b78

1 Like

I tried loading with the --hash argument and the 12-digit hash you gave me, and but doing this loads up the version:

devel-477edb9

And unfortunately this is a version that has produces an error. I also tried the full hash and the 7-digit hash, neither of which worked.

We have a deployment issue where builds are sometimes 1 behind the labelled hash. The specific commit there doesn’t change anything significant, so that version and the one it’s labelled as should function the same.

I think we have a fix for the underlying issue, though!

I get the following error in devel-477edb9, which I don’t get in in devel-3959178:

File “/tmp/623d02edd3ba46e2bdbc20bf61275614/GTEx_v8_eQTL_pipeline_combined.py”, line 162, in
analysis_set = tissue_ds.filter_rows(tissue_ds.locus.contig != ‘chr’ + chrom).repartition(200)
File “”, line 2, in repartition
File “/home/hail/hail.zip/hail/typecheck/check.py”, line 486, in _typecheck
File “/home/hail/hail.zip/hail/matrixtable.py”, line 2507, in repartition
File “/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/home/hail/hail.zip/hail/utils/java.py”, line 196, in deco
hail.utils.java.FatalError: FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 131.0 failed 20 times, most recent failure: Lost task 6.19 in stage 131.0 (TID 268, hail-3-w-0.c.gtex-v8.internal, executor 10): java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:132)
at is.hail.rvd.OrderedRVD$.getPartitionKeyInfo(OrderedRVD.scala:479)
at is.hail.rvd.OrderedRVD.coalesce(OrderedRVD.scala:186)
at is.hail.variant.MatrixTable.coalesce(MatrixTable.scala:2073)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Hail version: devel-477edb9
Error summary: FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)

This is a separate issue – no space left on device means that you don’t have enough disk space to store the data your pipeline is trying to shuffle.

These were run in the same cluster configuration, which I find bizarre since I never had to increase the default disk space allocation until now. But you’re probably right in that maybe I should try increasing the disk space size in the cluster startup - let me try this and get back to you.