Hail tutorial_data.tar download "disappears" on Google Cloud

Hi,

we are experiencing a curious error where the Hail tutorial_data.tar download seemingly succeeds, but we don’t see it materialize in our Google Cloud cluster, and also get a HailException suggesting the VDS cannot be found. Could you kindly suggest how to fix or work around this issue? Please see our Cloud Shell command, Python script, and resulting error messages below.

Sincerest thanks for any comments in advance!

--------------------------Running the command from the Cloud Shell----------------------------------

gcloud dataproc jobs submit pyspark pythontest.py --cluster=cluster-a73c --files=gs://hail-common/builds/0.1/jars/hail
-0.1-20613ed50c74-Spark-2.0.2.jar --py-files=gs://hail-common/builds/0.1/python/hail-0.1-20613ed50c74.zip --properties=spark.driver.extraClassPath=./hail-0.
1-20613ed50c74-Spark-2.0.2.jar,spark.executor.extraClassPath=./hail-0.1-20613ed50c74-Spark-2.0.2.jar

---------------------------------------pythontest.py------------------------------------------------

from hail import *
hc = HailContext()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import Counter
from math import log, isnan
from pprint import pprint

import os
if os.path.isdir(‘data/1kg.vds’) and os.path.isfile(‘data/1kg_annotations.txt’):
print(‘All files are present and accounted for!’)
else:
import sys
sys.stderr.write(‘Downloading data (~50M) from Google Storage…\n’)
import urllib
import tarfile
urllib.urlretrieve(‘https://storage.googleapis.com/hail-1kg/tutorial_data.tar’,
‘tutorial_data.tar’)
sys.stderr.write(‘Download finished!\n’)
sys.stderr.write(‘Extracting…\n’)
tarfile.open(‘tutorial_data.tar’).extractall()
if not (os.path.isdir(‘data/1kg.vds’) and os.path.isfile(‘data/1kg_annotations.txt’)):
raise RuntimeError(‘Something went wrong!’)
else:
sys.stderr.write(‘Done!\n’)

vds = hc.read(‘data/1kg.vds’)

-------------------------------------------Output--------------------------------------------------

Job [36b37434-387f-43c9-abbe-8efdd6519757] submitted.
Waiting for job output…
Running on Apache Spark version 2.0.2
SparkUI available at http://10.132.0.3:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ / / / /_/ /_/\_,_/_/_/ version 0.1-20613ed Downloading data (~50M) from Google Storage... Download finished! Extracting... Done! 2018-03-05 16:39:22 Hail: WARN:data/1kg.vds’ refers to no files
Traceback (most recent call last):
File “/tmp/36b37434-387f-43c9-abbe-8efdd6519757/pythontest.py”, line 33, in
vds = hc.read(‘data/1kg.vds’)
File “”, line 2, in read
File “/home/ec2-user/BuildAgent/work/4d93753832b3428a/python/hail/java.py”, line 121, in handle_py4j
hail.java.FatalError: HailException: arguments refer to no files: ‘data/1kg.vds’

Java stack trace:
is.hail.utils.HailException: arguments refer to no files: ‘data/1kg.vds’
at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:6)
at is.hail.utils.package$.fatal(package.scala:27)
at is.hail.HailContext.readAll(HailContext.scala:431)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Hail version: 0.1-20613ed
Error summary: HailException: arguments refer to no files: ‘data/1kg.vds’
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [36b37434-387f-43c9-abbe-8efdd6519757] entered state [ERROR] while waiting for [DONE].

Hi! This tutorial wasn’t really designed to work on the cloud, since it relies on local file unpacking and paths.

However, there’s something more important – you should totally get started with the 0.2 development version, which is pretty stable and is MUCH better than Hail 0.1. You can see the docs at https://www.hail.is/docs/devel/.

That tutorial will have the same problem. As a solution, I’ve put the two files it requires here:

gs://hail-tutorial/1kg.vcf.bgz
gs://hail-tutorial/1kg_annotations.txt

If you run this code with a path in your bucket before the tutorial:

    (hl.import_vcf('gs://hail-tutorial/1kg.vcf.bgz', min_partitions=4)
     .write('<your bucket>', overwrite=True))

then replace the ‘data/*’ paths as appropriate, it should work.

Hi Tim,

huge thanks for your help! I’m having trouble accessing the 0.2 docs link, though - could I ask you to resend it?

Much looking forward to trying out the 0.2 version!

I fixed the link above!

Super, thank you so much! Is the file access controlled, though? (See below.)

Hail version: 0.1-20613ed
Error summary: GoogleJsonResponseException: 403 Forbidden
{
“code” : 403,
“errors” : [ {
“domain” : “global”,
“message” : "131076197275-compute@developer.gserviceaccount.com does not have storage.objects.get access to hail-tutorial/1kg.vcf.bgz.
",
“reason” : “forbidden”
} ],
“message” : "131076197275-compute@developer.gserviceaccount.com does not have storage.objects.get access to hail-tutorial/1kg.vcf.bgz."
}

Oops! let me fix it now!