Hi,
we are experiencing a curious error where the Hail tutorial_data.tar download seemingly succeeds, but we don’t see it materialize in our Google Cloud cluster, and also get a HailException suggesting the VDS cannot be found. Could you kindly suggest how to fix or work around this issue? Please see our Cloud Shell command, Python script, and resulting error messages below.
Sincerest thanks for any comments in advance!
--------------------------Running the command from the Cloud Shell----------------------------------
gcloud dataproc jobs submit pyspark pythontest.py     --cluster=cluster-a73c     --files=gs://hail-common/builds/0.1/jars/hail
-0.1-20613ed50c74-Spark-2.0.2.jar     --py-files=gs://hail-common/builds/0.1/python/hail-0.1-20613ed50c74.zip     --properties=spark.driver.extraClassPath=./hail-0.
1-20613ed50c74-Spark-2.0.2.jar,spark.executor.extraClassPath=./hail-0.1-20613ed50c74-Spark-2.0.2.jar
---------------------------------------pythontest.py------------------------------------------------
from hail import *
hc = HailContext()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import Counter
from math import log, isnan
from pprint import pprint
import os
if os.path.isdir(‘data/1kg.vds’) and os.path.isfile(‘data/1kg_annotations.txt’):
print(‘All files are present and accounted for!’)
else:
import sys
sys.stderr.write(‘Downloading data (~50M) from Google Storage…\n’)
import urllib
import tarfile
urllib.urlretrieve(‘https://storage.googleapis.com/hail-1kg/tutorial_data.tar’,
‘tutorial_data.tar’)
sys.stderr.write(‘Download finished!\n’)
sys.stderr.write(‘Extracting…\n’)
tarfile.open(‘tutorial_data.tar’).extractall()
if not (os.path.isdir(‘data/1kg.vds’) and os.path.isfile(‘data/1kg_annotations.txt’)):
raise RuntimeError(‘Something went wrong!’)
else:
sys.stderr.write(‘Done!\n’)
vds = hc.read(‘data/1kg.vds’)
-------------------------------------------Output--------------------------------------------------
Job [36b37434-387f-43c9-abbe-8efdd6519757] submitted.
Waiting for job output…
Running on Apache Spark version 2.0.2
SparkUI available at http://10.132.0.3:4040
Welcome to
__  __     <>__
/ /_/ /__  __/ /
/ __  / _ / / / /_/ /_/\_,_/_/_/ version 0.1-20613ed Downloading data (~50M) from Google Storage... Download finished! Extracting... Done! 2018-03-05 16:39:22 Hail: WARN:data/1kg.vds’ refers to no files
Traceback (most recent call last):
File “/tmp/36b37434-387f-43c9-abbe-8efdd6519757/pythontest.py”, line 33, in 
vds = hc.read(‘data/1kg.vds’)
File “”, line 2, in read
File “/home/ec2-user/BuildAgent/work/4d93753832b3428a/python/hail/java.py”, line 121, in handle_py4j
hail.java.FatalError: HailException: arguments refer to no files: ‘data/1kg.vds’
Java stack trace:
is.hail.utils.HailException: arguments refer to no files: ‘data/1kg.vds’
at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:6)
at is.hail.utils.package$.fatal(package.scala:27)
at is.hail.HailContext.readAll(HailContext.scala:431)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.1-20613ed
Error summary: HailException: arguments refer to no files: ‘data/1kg.vds’
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [36b37434-387f-43c9-abbe-8efdd6519757] entered state [ERROR] while waiting for [DONE].