Annotation database


#1

Hi,
I’m trying to use the annotation database with the example 1kg VDS file, but keep getting errors like the below:

Job [c0475ca9-27ad-4e1a-833e-622e8ddeceef] submitted.
Waiting for job output...
Running on Apache Spark version 2.0.2
SparkUI available at http://10.132.0.2:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.1-0d9e264
2017-09-08 22:45:54 Hail: WARN: called redundant split on an already split VDS
Traceback (most recent call last):
  File "/tmp/c0475ca9-27ad-4e1a-833e-622e8ddeceef/annot.py", line 10, in <module>
    'va.gencode19'
  File "/home/ec2-user/BuildAgent/work/4d93753832b3428a/python/hail/dataset.py", line 997, in annotate_variants_db
sqlite3.OperationalError: too many terms in compound SELECT
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [c0475ca9-27ad-4e1a-833e-622e8ddeceef] entered state [ERROR] while
 waiting for [DONE].

I’ve tried with both this file as well as a sites only VDS generated from a fairly generic variant list. any advice much appreciated.
thanks


#2

What was the list of annotations supplied to the method? I can sort of see the failure mode, but a little more detail would help us pin it down.

Thanks, Jules!


#3

Ah yes, sorry not to have included. The annotations requested were quite simple, e.g.
sites_vds.annotate_variants_db([
‘va.cadd’,
‘va.fantom5’
])
The error about too many terms was pretty consistent across a few different attempts.
thanks!


#4

Hey @Jules_Maller, I can’t reproduce this.

I started a cluster using Liam’s cloud tools:

cluster start dk-sql-test

then I connected to a notebook

cluster connect dk-sql-test notebook

then I executed this Python script:

from hail import *
hc = HailContext()
vds = hc.balding_nichols_model(3,100,100)
vds2 = vds.annotate_variants_db([
'va.cadd',
'va.fantom5'
])
vds2.count()
vds2.variant_schema

which outputs:

Struct{ancestralAF:Double,AF:Array[Double],fantom5:Struct{robust:Boolean,permissive:Boolean},cadd:Struct{RawScore:Double,PHRED:Double}}

which is what I expect.

Can you give some more details about how you’re creating your cluster, such as what type of machines and OSes you’re using? Is this a vanilla dataproc cluster?


#5

that worked! thanks. no idea what was going wrong before, but this is a fine path as far as what I wanted to do so not too bothered. thanks!