Annotating with CADD, gnomad, Clinvar & dbNSFP on UKB RAP

i’m just wondering if you can specify cadd, gnomad, clinvar and dbNSFP options when annotating with hail on dxjupyterlab_spark_cluster o the UKB RAP? From the hail website, the following command can be used on your matrix file to annotate with these features:

db = hl.experimental.DB(region=‘us’, cloud=‘gcp’)

mt = db.annotate_rows_db(mt, ‘CADD’, ‘clinvar_gene_summary’, ‘clinvar_variant_summary’, ‘dbNSFP_genes’, ‘dbNSFP_variants’, ‘dbSNP_rsid’, ‘gnomad_exome_sites’)

weblink: Hail | Annotation Database

Unfortunately, this command does not work on hail when using the spark jupyterlab python3 console. The error that is given is:

Hail version: 0.2.78-b17627756568

Error summary: IOException: No FileSystem for scheme: gs

I know the EU version of this hail command is not available but it is available on the US version. Since the UKB RAP is based in London, is there any workaround for this?

Any help with this would be great.

Before we chase this down further, could you update to the latest version? 0.2.78 is pretty old.

This means the version of Spark made available to you by DNANexus lacks the Google Cloud Storage “hadoop connector.” Moreover, AFAIK, DNANexus is a thin wrapper around AWS. You won’t be able to read data out of Google Cloud Storage into AWS without incurring significant cost. You should explicitly set the cloud to aws:

db = hl.experimental.DB(region='us', cloud='aws')

Ah ok, and will this workaround the fact that the UKB RAP uses London based AWS services?

Also, Hail 0.2.78 is the only version of Hail provided by the UKB RAP afaik, unless there is a way of updating this manually on the RAP.

Ah, yeah, this is a good point. You’ll pay 0.02 USD per GB to stream the data out of our US East buckets to Europe. Most of the datasets you’re referencing aren’t that large, but it is a cost. If you’ll be using these datasets frequently, you can also copy them directly out of our S3 bucket into one you control in the London region. We ship a JSON file with the pip package which identifies the URL for every dataset. You can also see that file in our GitHub repository.

Ah cheers. It seems to run when ‘aws’ is specified. Just wondering when exporting to a .tsv file how do I specify exporting the annotate_db columns? Before when I was exporting my vep annotations I used this code:
annotated_mt2.vep.export(“file:///opt/notebooks/ukb23148_c1_b0_v1.annotate.tsv.gz”, header=True, delimiter=’\t’)

Take a look at mt.describe(). Each dataset should be added under its own name. You can export any Hail Table as a TSV, so consider doing this:

ht = mt.rows()
ht.select('CADD', 'clinvar_gene_summary',  ...).export('...', ...)

I’d also give a careful thought to why you need a TSV. You can store all this information in an efficient binary format, the Hail Table format:

ht = mt.rows()
ht.select(...).write('...')

Writing a TSV, in contrast, requires concatenation of the TSV into one big TSV. This requires streaming through all the data on a single core. As such, it’s somewhat slow.