Hail tutorial_data.tar download "disappears" on Google Cloud

Muuvaara · March 6, 2018, 3:46pm

Hi,

we are experiencing a curious error where the Hail tutorial_data.tar download seemingly succeeds, but we don’t see it materialize in our Google Cloud cluster, and also get a HailException suggesting the VDS cannot be found. Could you kindly suggest how to fix or work around this issue? Please see our Cloud Shell command, Python script, and resulting error messages below.

Sincerest thanks for any comments in advance!

--------------------------Running the command from the Cloud Shell----------------------------------

gcloud dataproc jobs submit pyspark pythontest.py --cluster=cluster-a73c --files=gs://hail-common/builds/0.1/jars/hail
-0.1-20613ed50c74-Spark-2.0.2.jar --py-files=gs://hail-common/builds/0.1/python/hail-0.1-20613ed50c74.zip --properties=spark.driver.extraClassPath=./hail-0.
1-20613ed50c74-Spark-2.0.2.jar,spark.executor.extraClassPath=./hail-0.1-20613ed50c74-Spark-2.0.2.jar

---------------------------------------pythontest.py------------------------------------------------

from hail import *
hc = HailContext()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from collections import Counter
from math import log, isnan
from pprint import pprint

import os
if os.path.isdir(‘data/1kg.vds’) and os.path.isfile(‘data/1kg_annotations.txt’):
print(‘All files are present and accounted for!’)
else:
import sys
sys.stderr.write(‘Downloading data (~50M) from Google Storage…\n’)
import urllib
import tarfile
urllib.urlretrieve(‘https://storage.googleapis.com/hail-1kg/tutorial_data.tar’,
‘tutorial_data.tar’)
sys.stderr.write(‘Download finished!\n’)
sys.stderr.write(‘Extracting…\n’)
tarfile.open(‘tutorial_data.tar’).extractall()
if not (os.path.isdir(‘data/1kg.vds’) and os.path.isfile(‘data/1kg_annotations.txt’)):
raise RuntimeError(‘Something went wrong!’)
else:
sys.stderr.write(‘Done!\n’)

vds = hc.read(‘data/1kg.vds’)

-------------------------------------------Output--------------------------------------------------

Job [36b37434-387f-43c9-abbe-8efdd6519757] submitted.
Waiting for job output…
Running on Apache Spark version 2.0.2
SparkUI available at http://10.132.0.3:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ / / / /_/ /_/\_,_/_/_/ version 0.1-20613ed Downloading data (~50M) from Google Storage... Download finished! Extracting... Done! 2018-03-05 16:39:22 Hail: WARN:data/1kg.vds’ refers to no files
Traceback (most recent call last):
File “/tmp/36b37434-387f-43c9-abbe-8efdd6519757/pythontest.py”, line 33, in
vds = hc.read(‘data/1kg.vds’)
File “”, line 2, in read
File “/home/ec2-user/BuildAgent/work/4d93753832b3428a/python/hail/java.py”, line 121, in handle_py4j
hail.java.FatalError: HailException: arguments refer to no files: ‘data/1kg.vds’

Java stack trace:
is.hail.utils.HailException: arguments refer to no files: ‘data/1kg.vds’
at is.hail.utils.ErrorHandling$class.fatal(ErrorHandling.scala:6)
at is.hail.utils.package$.fatal(package.scala:27)
at is.hail.HailContext.readAll(HailContext.scala:431)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Hail version: 0.1-20613ed
Error summary: HailException: arguments refer to no files: ‘data/1kg.vds’
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [36b37434-387f-43c9-abbe-8efdd6519757] entered state [ERROR] while waiting for [DONE].

tpoterba · March 6, 2018, 3:54pm

Hi! This tutorial wasn’t really designed to work on the cloud, since it relies on local file unpacking and paths.

However, there’s something more important – you should totally get started with the 0.2 development version, which is pretty stable and is MUCH better than Hail 0.1. You can see the docs at https://www.hail.is/docs/devel/.

That tutorial will have the same problem. As a solution, I’ve put the two files it requires here:

gs://hail-tutorial/1kg.vcf.bgz
gs://hail-tutorial/1kg_annotations.txt

If you run this code with a path in your bucket before the tutorial:

    (hl.import_vcf('gs://hail-tutorial/1kg.vcf.bgz', min_partitions=4)
     .write('<your bucket>', overwrite=True))

then replace the ‘data/*’ paths as appropriate, it should work.

Muuvaara · March 6, 2018, 7:07pm

Hi Tim,

huge thanks for your help! I’m having trouble accessing the 0.2 docs link, though - could I ask you to resend it?

Much looking forward to trying out the 0.2 version!

tpoterba · March 6, 2018, 7:08pm

I fixed the link above!

Muuvaara · March 6, 2018, 7:27pm

Super, thank you so much! Is the file access controlled, though? (See below.)

Hail version: 0.1-20613ed
Error summary: GoogleJsonResponseException: 403 Forbidden
{
“code” : 403,
“errors” : [ {
“domain” : “global”,
“message” : "131076197275-compute@developer.gserviceaccount.com does not have storage.objects.get access to hail-tutorial/1kg.vcf.bgz.
",
“reason” : “forbidden”
} ],
“message” : "131076197275-compute@developer.gserviceaccount.com does not have storage.objects.get access to hail-tutorial/1kg.vcf.bgz."
}

tpoterba · March 6, 2018, 7:39pm

Oops! let me fix it now!

Topic		Replies	Views
I cannot download tutorial.tar.gz Hail Query & hailctl	2	375	April 12, 2021
Using Hail on the Google Cloud Platform Help [0.1]	18	14015	September 14, 2017
Reading VDS from google bucket fires error Help [0.1]	9	3917	May 13, 2017
Error in load_dataset() Hail Query & hailctl	6	210	May 23, 2024
Error 403 Forbidden when I try to load the hail experimental dataset '1000_Genomes_autosomes' Hail Query & hailctl	15	923	October 29, 2020

Hail tutorial_data.tar download "disappears" on Google Cloud

Related topics