I’m trying to use the annotation database with the example 1kg VDS file, but keep getting errors like the below:
Job [c0475ca9-27ad-4e1a-833e-622e8ddeceef] submitted.
Waiting for job output...
Running on Apache Spark version 2.0.2
SparkUI available at http://10.132.0.2:4040
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.1-0d9e264
2017-09-08 22:45:54 Hail: WARN: called redundant split on an already split VDS
Traceback (most recent call last):
File "/tmp/c0475ca9-27ad-4e1a-833e-622e8ddeceef/annot.py", line 10, in <module>
File "/home/ec2-user/BuildAgent/work/4d93753832b3428a/python/hail/dataset.py", line 997, in annotate_variants_db
sqlite3.OperationalError: too many terms in compound SELECT
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [c0475ca9-27ad-4e1a-833e-622e8ddeceef] entered state [ERROR] while
waiting for [DONE].
I’ve tried with both this file as well as a sites only VDS generated from a fairly generic variant list. any advice much appreciated.
What was the list of annotations supplied to the method? I can sort of see the failure mode, but a little more detail would help us pin it down.
Ah yes, sorry not to have included. The annotations requested were quite simple, e.g.
The error about too many terms was pretty consistent across a few different attempts.
Hey @Jules_Maller, I can’t reproduce this.
I started a cluster using Liam’s cloud tools:
cluster start dk-sql-test
then I connected to a notebook
cluster connect dk-sql-test notebook
then I executed this Python script:
from hail import *
hc = HailContext()
vds = hc.balding_nichols_model(3,100,100)
vds2 = vds.annotate_variants_db([
which is what I expect.
Can you give some more details about how you’re creating your cluster, such as what type of machines and OSes you’re using? Is this a vanilla dataproc cluster?
that worked! thanks. no idea what was going wrong before, but this is a fine path as far as what I wanted to do so not too bothered. thanks!