New hail and gnomAD, setup help is needed badly :-)


#1

Hello,

I have downloaded the gnomAD files to my cloud bucket and go the following tree:
https://console.cloud.google.com/storage/browser/gnomad-public/release-170228

I would like to work with data at the format of VDS, as described on Hail.is: https://hail.is/hail/overview.html#variant-dataset-vds

Haile setup with simple machines cluster - I am using simpler cluster. highmem machines are not available for me at trial version.

Could you refer me to a video or page with guidelines on how to continue from that point. Most of the pages that I have see relate to a VDS files. I have seen only VDS folders (Google bucket style) with jason and many partitioned small files.
Any help is very appreciated!!!
Thanks,
eilalan


#2

The VDS format is folder! We often refer to it as a file for simplicity. Take a look at this post for how to get started on Google cloud: Using Hail on the Google Cloud Platform


#3

Thank you for the quick response.
I have followed the guidelines and running the sample python script and received an error about decorator module.

Is there any additional pre-requisite that I have missed?
Many thanks!!!

gcloud dataproc jobs submit pyspark --cluster=hail-start --files=gs://hail-common/hail-hail-is-master-all-spark2.0.2-e4
880e9.jar --py-files=gs://hail-common/pyhail-hail-is-master-e4880e9.zip --properties=“spark.driver.extraClassPath=./hail-hail-is-master-all-
spark2.0.2-e4880e9.jar,spark.executor.extraClassPath=./hail-hail-is-master-all-spark2.0.2-e4880e9.jar” myhailscript.py
Copying file://myhailscript.py [Content-Type=text/x-python]…
/ [1 files][ 119.0 B/ 119.0 B]
Operation completed over 1 objects/119.0 B.
Job [212367e7-10f3-47ef-b37f-831d3d638f16] submitted.
Waiting for job output…
Traceback (most recent call last):
File “/tmp/212367e7-10f3-47ef-b37f-831d3d638f16/myhailscript.py”, line 1, in
from hail import *
File “/home/ec2-user/BuildAgent/work/c38e75e72b769a7c/python/hail/init.py”, line 1, in
File “/home/ec2-user/BuildAgent/work/c38e75e72b769a7c/python/hail/representation/init.py”, line 1, in
File “/home/ec2-user/BuildAgent/work/c38e75e72b769a7c/python/hail/representation/variant.py”, line 1, in
File “/home/ec2-user/BuildAgent/work/c38e75e72b769a7c/python/hail/java.py”, line 2, in
ImportError: No module named decorator
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [212367e7-10f3-47ef-b37f-831d3d638f16] entered state [ERROR] while waiting for [DONE].


#4

See here: Python error: ImportError: No module named decorator


#5

thank you. the --initialization-actions is done.

I am getting the below permission issue when executing the python script. I am the owner of the project and I am not sure why am I getting this error.


luster-1-m:~$ gcloud dataproc jobs submit pyspark --cluster=cluster-1 --files=gs://hail-common/hail-hail-is-master-all-spark2.0.2-e4880e9.jar --p
y-files=gs://hail-common/pyhail-hail-is-master-e4880e9.zip --properties=“spark.driver.extraClassPath=./hail-hail-is-master-all-spark2.0.2-e4880e9.jar,s
park.executor.extraClassPath=./hail-hail-is-master-all-spark2.0.2-e4880e9.jar” myhailscript.py
ERROR: (gcloud.dataproc.jobs.submit.pyspark) PERMISSION_DENIED: Request had insufficient authentication scopes.


Once I have the environment working, I would like to look at one of the VDS folder in bigQuery. Does anyone have experience working with exporting gnomAD data to bigQuery?
Any known issues with querying the samples / variants using BigQuery?

Thanks,
eilalan


#6

I think this will solve your authentication issue: http://stackoverflow.com/questions/35928534/403-request-had-insufficient-authentication-scopes-during-gcloud-container-clu

If not, searching the error message turns up a bunch of other possibilities too.

Regarding BigQuery: I don’t see any reason why it wouldn’t work. You’ll probably want to use something like:

vds.variants_keytable().annotate('js = json({v: v, va: va})').select('js').to_dataframe().write.csv('...')

to transform the data into a BigQuery-ingestible json file


#7

thank you. I was able to figure this and running into file not found from hail-common. see below.
I would like the use pyhail-submit or pyhail or anything else. How do I use it? Do I copy it to my bucket?
To the master? Using gsutil? I am new to python as well - so basic info might be needed.


gsutil cat gs://hail-common/latest-hash.txt
E4880e9


gcloud dataproc jobs submit pyspark --cluster=cluster-2 --files=gs://hail-common/hail-hail-is-master-all-spark2.0.2-E4880e9.jar --py-files=gs://hail-common/pyhail-hail-is-master-E4880e9.zip --properties=“spark.driver.extraClassPath=./hail-hail-is-master-all-spark2.0.2-E4880e9.jar,spark.executor.extraClassPath=./hail-hail-is-master-all-spark2.0.2-E4880e9.jar” myhailscript.py


=========== Cloud Dataproc Agent Error ===========
java.io.FileNotFoundException: File not found : gs://hail-common/pyhail-hail-is-master-E4880e9.zip
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.getFileStatus(GoogleHadoopFileSystemBase.java:1427)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2034)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:2003)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.copyToLocalFile(GoogleHadoopFileSystemBase.java:2006)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1979)
at com.google.cloud.hadoop.services.agent.util.HadoopUtil.download(HadoopUtil.java:71)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.downloadResources(AbstractJobHandler.java:424)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:543)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler$StartDriver.call(AbstractJobHandler.java:532)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
======== End of Cloud Dataproc Agent Error ========


#8

I am all set. running with iphyton on spark machine
Thanks!!!


#9

is there a way to keep the machines setup and keeping the cost low. should i save the image or any other idea?


#10

not that I know of. If you’ve got your data in a google bucket, then the best way to save your ‘state’ is to save the Hail code you ran! Run it again the next time you’re doing analysis!

Alternatively, you can write a VDS to a new file (really a folder, as you’ve discovered ;)) if you want a permanent record of what you’ve done to it (filtered, annotated, etc).


#11

thank. I guess that I will have to finish the export data soon :slight_smile:

I was running the command to convert VDS to CSV (from this thread).An error is being fired when I am defining the columns that I want to export to CSV. I think that I need some king of definition import for the keyTable (v, va)…but I might be wrong. Could you please take a look?
Thanks,
eilalan


The script (I broke the command to multiple small commands to find out where the issues it - the line that fires the error is bolded):
from hail import *
import json

print(“hc”)
hc = HailContext()
#read the VDS
#convert the VDS to VCF
print(“vds”)
vds = hc.read(‘gs://data_gnomad_orielresearch/gnomad.exomes.r2.0.1.sites.Y.vds’)
print(“finished read - Read .vds files as variant dataset.”)
#########################
##VARIANTS ANNOTATION
######################
print(“running variants_keytable”)
ktv = vds.variants_keytable()
print (“ktv is Key table with variants and variants annotations. ktv from type KeyTable”)
print(“Add new columns computed from existing columns. Select a subset of columns.”)
ktv_select = ktv.annotate(‘js = json({v: v, va: va})’).select(‘js’)

print(“ktv_select from type: keyTable”)

print(“Converts this key table to a Spark DataFrame.”)
ktv_df = ktv_select.to_dataframe()
print(“ktv_df from type: pyspark.sql.DataFrame”)

print(“write ktv_df to csv”)
ktv_df.write.csv(“gs://data_gnomad_orielresearch/vdsToCsv_variants.csv”)
print(“gs://data_gnomad_orielresearch/vdsToCsv_variants.csv was written - check it”)


File “/tmp/47957ee3-bdc5-4add-955b-2a14062b1534/convertVDSToCSV.py”, line 32, in
ktv_select = ktv.annotate(‘js = json({v: v, va: va})’).select(‘js’)
File “”, line 2, in select
File “/home/ec2-user/BuildAgent/work/c38e75e72b769a7c/python/hail/java.py”, line 119, in handle_py4j
hail.java.FatalError: An error occurred while calling into JVM, probably due to invalid parameter types.

Java stack trace:
An error occurred while calling o66.select. Trace:
py4j.Py4JException: Method select([class java.lang.String, class java.util.ArrayList]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)

Hail version: devel-fff80b1
Error summary: An error occurred while calling into JVM, probably due to invalid parameter types.
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [47957ee3-bdc5-4add-955b-2a14062b1534] entered state [ERROR] while waiting for [DONE].
wm8af-056:scripts landkof$


#12

Ugh, I hate these error messages. We’ll have this fixed next week.

I think the issue here is that select only takes a list of str, so it should be:

ktv_select = ktv.annotate('js = json({v: v, va: va})').select(['js'])

#13

thank you. this was very helpful. I was able to make some progress with the VDS=>CSV=>BQ data export.

this is the command on the python:
ktv_df.write.mode(“overwrite”).csv(“gs://data_gnomad_orielresearch/vdsToCsv_variants.csv”,header=True)

BigQuery load command is:
bq load -F , --source_format CSV --skip_leading_rows=0 --max_bad_records=1 --autodetect --quote=" " gnomad_bq_orielresearch.vdsToCsv_variants_csv_bq gs://***.csv

–autodetect - is schema auto detect
–quote=" " - to tolerate the double quote in the values

A bigQuery table was generate with over 800 columns.
the table is not readable (See below, sample for headers and values). Do you have any idea how I can format it better, using the pySpark SQL or bq?
Thank for the help!
eilalan

Row string_field_0 string_field_1 string_field_2
1 "{“v”:{“contig”:“Y” “start”:2710308 “ref”:"GTTTT"
2 "{“v”:{“contig”:“Y” “start”:2829226 “ref”:"A"
3 "{“v”:{“contig”:“Y” “start”:2829239 “ref”:“A”


#14

I think Spark is probably messing up Hail’s formatting here. KeyTables have an export method, so you can do something like:

(ktv.annotate('js = json({v: v, va: va})')
    .select(['js'])
    .export('gs://data_gnomad_orielresearch/vdsToCsv_variants.csv'))

I’ve also made an issue to add a parallel option to keytable write. Currently, we write out in parallel and then concatenate the files in serial.


#15

Thank you the export passed. the export is more elegant function than the data frame write command; it is concatenating the files and allow one bigQuery call instead of multiples.

The bq load command generated error (below). I haven’t explored it yet. It might be schema issue or something else. Please let me know if you have any idea.

Thanks,
eilala


bq load -F , --source_format CSV --skip_leading_rows=0 --max_bad_records=1 --autodetect --quote=" " gnomad_bq_orielresearch.vdsToCsv_variants_csv_bq gs://data_gnomad_orielresearch/vdsToCsv_variants.csv
Waiting on bqjob_rb2e63c316a01d78_0000015c0989111a_1 … (4s) Current status: DONE
BigQuery error in load operation: Error processing job ‘speedy-emissary-167213:bqjob_rb2e63c316a01d78_0000015c0989111a_1’: Too many errors encountered.
Failure details:

  • gs://data_gnomad_orielresearch/vdsToCsv_variants.csv: Too many
    values in row starting at position: 13658627.
  • gs://data_gnomad_orielresearch/vdsToCsv_variants.csv: Too many
    values in row starting at position: 14405543.

#16

I was able to resolve that with the flag --ignore_unknown_value (I can check these rows later), however, the table is still not readable (see below) and with no headers.
It looks like the data in the cells is correct…if no other choice, I can make some string manipulation.
is there a way to add header to the export command? can i use the gnomad VCF files and add more annotations like sample ID? other idea?
thanks
eilalan

bq load -F , --source_format CSV --skip_leading_rows=0 --max_bad_records=3 --autodetect --quote=" " --ignore_unknown_values gnomad_bq_orielresearch.vdsToCsv_variants_csv_bq gs://data_gnomad_orielresearch/vdsToCsv_variants.csv
Waiting on bqjob_r6b6dfcdf839f0d3b_0000015c09920abc_1 … (23s) Current status: DONE


Row string_field_0 string_field_1 string_field_2 string_field_3
1 "{“v”:{“contig”:“Y” “start”:2710308 “ref”:"GTTTT"
2 "{“v”:{“contig”:“Y” “start”:2829226 “ref”:"A"
3 "{“v”:{“contig”:“Y” “start”:2829239 “ref”:"A"
4 "{“v”:{“contig”:“Y” “start”:2829170 “ref”:“A”


#17

I’m not really sure what you mean about headers – this file only has one column, the json-ized blob of the variant and all the annotations. By default, kt.export will write columns separated by tabs, and writes a header with the column names.

Any idea what the “too many values” error means?


#18

Hi, some information and more questions in bold. I will explore it more and will let you know if i find anything.

headers means headers name like GENE, variant type, etc. The headers in the created bq are string_field_0 / 1 / 2 / …
I will check for ways to use the headers that were created by the export command.
bq load can get schema as parameter. can I export the schema to a jason file / or other format to be uses as input to bq load?

any idea how to eliminate the extra characters, foe example: "{“v”:{“contig”:“Y” - what is the column name and what is the cell value in this string?


My goal is to have sample id for every variant - so I can run multivariate comparison, multivariate interactive visualization of the data.
If it is easier to build it from the vcfs, I can try that too. I assume that the vcfs don’t include sample id. do you know of a way to add sample id to these files?

i run bq --format=prettyjson show -j to find out more about the errors (see below). no much information is there.

google based search results for “too many values”


“status”: {
“errorResult”: {
“message”: “Too many errors encountered.”,
“reason”: “invalid”
},
“errors”: [
{
“message”: “Too many errors encountered.”,
“reason”: “invalid”
},
{
“location”: “gs://data_gnomad_orielresearch/vdsToCsv_variants.csv”,
“message”: “Too many values in row starting at position: 13658627.”,
“reason”: “invalid”
},
{
“location”: “gs://data_gnomad_orielresearch/vdsToCsv_variants.csv”,
“message”: “Too many values in row starting at position: 14405543.”,
“reason”: “invalid”
},
{
“location”: “gs://data_gnomad_orielresearch/vdsToCsv_variants.csv”,
“message”: “Too many values in row starting at position: 14434960.”,
“reason”: “invalid”
},
{
“location”: “gs://data_gnomad_orielresearch/vdsToCsv_variants.csv”,
“message”: “Too many values in row starting at position: 14539024.”,
“reason”: “invalid”
}
],
“state”: “DONE”


#19

Hi,
I have made some progress.
Export to TSV is much cleaner (see table snapshot below).
The header names is the name of the jason struct (see below - js). Is there a way to explicitly send the schema as parameter to the export command?
thank a lot,
eilalan

Following are the tries and the outputs for generating the schema:
print(“key table generation”)
ktv_select = ktv.annotate(‘js = json({v: v, va: va})’).select([‘js’])

print(“key table schema”)
print(ktv_select.schema)

print(“Converts this key table to a Spark DataFrame and print.”)
ktv_select.to_dadataframe().printSchema()

print(“write to kt file”)
ktv_select.write(‘gs://data_gnomad_orielresearch/kt_file.kt’)

OUTPUT:
key table generation
key table schema
Struct {
js: String
}
Converts this key table to a Spark DataFrame and print.
Traceback (most recent call last):
File “/tmp/095bfe8d-7867-4bab-add4-4247269ae8c1/convertVDSToCSV_variant.py”, line 30, in
ktv_select.to_dadataframe().printSchema()
AttributeError: ‘KeyTable’ object has no attribute 'to_dadataframe’
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [095bfe8d-7867-4bab-add4-4247269ae8c1] entered state [ERROR] while waiting for [DONE].


TSV table:

Row string_field_0 string_field_1 string_field_2 string_field_3 string_field_4
1 js null null null null
2 {“v”:{“con ig”:“Y”,"s ar ":4967108,“ref”:“G”,“al Alleles”:[{“ref”:“G”,"al
3 {“v”:{“con ig”:“Y”,"s ar ":4968377,“ref”:“A”,“al Alleles”:[{“ref”:“A”,"al

print(“export to tsv”)
(ktv.annotate(‘js = json({v: v, va: va})’)
.select([‘js’])
.export(‘gs://data_gnomad_orielresearch/vdsToCsv_variants.tsv’))



#20

I think you mistyped to_dataframe as to_dadataframe. Does this resolve your issue?