Hail 0.2 - ability to load a specific past version?

bjo · May 11, 2018, 4:10pm

Hello,

Is it possible for me to load a specific version of Hail 0.2 when starting up the gcp cluster?

It seems like for Hail 0.2 there are (almost) daily commits (which is great to see, that you’re optimizing thing really fast), but these commits have a high chance it’ll break something in my pipeline - ideally I’ll be able to stick to a build that I tested will work for my use case for a reasonable timeline.

tpoterba · May 11, 2018, 4:21pm

Are you using cloudtools? If so, you can use the --hash argument to select a specific git hash. Note that they have to be 12-digit hashes, like b76333115a3f.

bjo · May 11, 2018, 4:36pm

Great, thanks! Maybe a really naive question, how can I check the 12-digit hash of a particular version? (A version currently running in a gcp cluster, for instance)

tpoterba · May 11, 2018, 4:37pm

hail.__version__ should do it. Might not be 12 digits, though…

tpoterba · May 11, 2018, 4:38pm

but if you post whatever you get I can look it up in the git history.

bjo · May 11, 2018, 4:40pm

devel-3959178 this is what I’m getting, which is also visible in startup message.

tpoterba · May 11, 2018, 4:42pm

here’s the full hash: 395917870b78a5ef8d173f52ad0b1e8a57239e01
and the 12: 395917870b78

bjo · May 11, 2018, 8:58pm

I tried loading with the --hash argument and the 12-digit hash you gave me, and but doing this loads up the version:

devel-477edb9

And unfortunately this is a version that has produces an error. I also tried the full hash and the 7-digit hash, neither of which worked.

tpoterba · May 11, 2018, 9:54pm

We have a deployment issue where builds are sometimes 1 behind the labelled hash. The specific commit there doesn’t change anything significant, so that version and the one it’s labelled as should function the same.

I think we have a fix for the underlying issue, though!

bjo · May 11, 2018, 10:17pm

I get the following error in devel-477edb9, which I don’t get in in devel-3959178:

File “/tmp/623d02edd3ba46e2bdbc20bf61275614/GTEx_v8_eQTL_pipeline_combined.py”, line 162, in
analysis_set = tissue_ds.filter_rows(tissue_ds.locus.contig != ‘chr’ + chrom).repartition(200)
File “”, line 2, in repartition
File “/home/hail/hail.zip/hail/typecheck/check.py”, line 486, in _typecheck
File “/home/hail/hail.zip/hail/matrixtable.py”, line 2507, in repartition
File “/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/home/hail/hail.zip/hail/utils/java.py”, line 196, in deco
hail.utils.java.FatalError: FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 131.0 failed 20 times, most recent failure: Lost task 6.19 in stage 131.0 (TID 268, hail-3-w-0.c.gtex-v8.internal, executor 10): java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at is.hail.sparkextras.ContextRDD.collect(ContextRDD.scala:132)
at is.hail.rvd.OrderedRVD$.getPartitionKeyInfo(OrderedRVD.scala:479)
at is.hail.rvd.OrderedRVD.coalesce(OrderedRVD.scala:186)
at is.hail.variant.MatrixTable.coalesce(MatrixTable.scala:2073)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Hail version: devel-477edb9
Error summary: FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1526058154763_0001/blockmgr-56a30c5b-0fdd-4e91-84f2-27b7965b8b60/0e/temp_shuffle_c9d9ef15-c078-4200-9ef6-80e70ec59d34 (No space left on device)

tpoterba · May 11, 2018, 10:18pm

This is a separate issue – no space left on device means that you don’t have enough disk space to store the data your pipeline is trying to shuffle.

bjo · May 11, 2018, 10:23pm

These were run in the same cluster configuration, which I find bizarre since I never had to increase the default disk space allocation until now. But you’re probably right in that maybe I should try increasing the disk space size in the cluster startup - let me try this and get back to you.

Topic		Replies	Views
0.3 Hail release Development	4	909	December 30, 2019
Change log question Hail Query & hailctl	2	293	June 27, 2022
Summary of main changes between 0.1 and 0.2? Hail Query & hailctl	1	534	April 25, 2018
Deployment Changes - branching off for faster development Updates	0	963	July 29, 2017
Update Hail 0.2 + Time to read MT file Help [0.1]	4	581	July 11, 2018

Hail 0.2 - ability to load a specific past version?

Related topics