Annotating with CADD, gnomad, Clinvar & dbNSFP on UKB RAP

dint · May 9, 2022, 1:33pm

i’m just wondering if you can specify cadd, gnomad, clinvar and dbNSFP options when annotating with hail on dxjupyterlab_spark_cluster o the UKB RAP? From the hail website, the following command can be used on your matrix file to annotate with these features:

db = hl.experimental.DB(region=‘us’, cloud=‘gcp’)

mt = db.annotate_rows_db(mt, ‘CADD’, ‘clinvar_gene_summary’, ‘clinvar_variant_summary’, ‘dbNSFP_genes’, ‘dbNSFP_variants’, ‘dbSNP_rsid’, ‘gnomad_exome_sites’)

weblink: Hail | Annotation Database

Unfortunately, this command does not work on hail when using the spark jupyterlab python3 console. The error that is given is:

Hail version: 0.2.78-b17627756568

Error summary: IOException: No FileSystem for scheme: gs

I know the EU version of this hail command is not available but it is available on the US version. Since the UKB RAP is based in London, is there any workaround for this?

Any help with this would be great.

tpoterba · May 9, 2022, 1:34pm

Before we chase this down further, could you update to the latest version? 0.2.78 is pretty old.

danking · May 9, 2022, 1:47pm

This means the version of Spark made available to you by DNANexus lacks the Google Cloud Storage “hadoop connector.” Moreover, AFAIK, DNANexus is a thin wrapper around AWS. You won’t be able to read data out of Google Cloud Storage into AWS without incurring significant cost. You should explicitly set the cloud to aws:

db = hl.experimental.DB(region='us', cloud='aws')

dint · May 9, 2022, 1:51pm

Ah ok, and will this workaround the fact that the UKB RAP uses London based AWS services?

Also, Hail 0.2.78 is the only version of Hail provided by the UKB RAP afaik, unless there is a way of updating this manually on the RAP.

danking · May 9, 2022, 2:17pm

Ah, yeah, this is a good point. You’ll pay 0.02 USD per GB to stream the data out of our US East buckets to Europe. Most of the datasets you’re referencing aren’t that large, but it is a cost. If you’ll be using these datasets frequently, you can also copy them directly out of our S3 bucket into one you control in the London region. We ship a JSON file with the pip package which identifies the URL for every dataset. You can also see that file in our GitHub repository.

dint · May 9, 2022, 3:08pm

Ah cheers. It seems to run when ‘aws’ is specified. Just wondering when exporting to a .tsv file how do I specify exporting the annotate_db columns? Before when I was exporting my vep annotations I used this code:
annotated_mt2.vep.export(“file:///opt/notebooks/ukb23148_c1_b0_v1.annotate.tsv.gz”, header=True, delimiter=’\t’)

danking · May 9, 2022, 3:37pm

Take a look at mt.describe(). Each dataset should be added under its own name. You can export any Hail Table as a TSV, so consider doing this:

ht = mt.rows()
ht.select('CADD', 'clinvar_gene_summary',  ...).export('...', ...)

I’d also give a careful thought to why you need a TSV. You can store all this information in an efficient binary format, the Hail Table format:

ht = mt.rows()
ht.select(...).write('...')

Writing a TSV, in contrast, requires concatenation of the TSV into one big TSV. This requires streaming through all the data on a single core. As such, it’s somewhat slow.

dint · May 25, 2022, 1:56pm

Hi Dan,

Cheers for the info, I’m just trying to export this now as a VCF. I want to export as a VCF as I want to filter my variants on my own server and not on DNAnexus. In order to export as a vcf I had to manually select which annotations I would like in the info field using this code:
annotated_mt = annotated_mt.annotate_rows(info=annotated_mt.info.annotate(CADD_PHRED=annotated_mt.CADD[“PHRED_score”]))

However, when I try exporting as a VCF I get the following error:
Error summary: VCFParseError: unexpected end of line

I can export as a.tsv file from a hail table but this does not contain the genotype information so it makes it a bit trickier for variant filtering on DNAnexus. As well as this, some annotations were not inputted into the info field as simple strings or floats and could not be exported. Example:

annotated_mt = annotated_mt.annotate_rows(info=annotated_mt.info.annotate(gnomAD_pLI = annotated_mt.dbNSFP_genes[“gnomAD_pLI”]))

Just wondering if there is a simpler way to export annotations in a VCF file for variant filtering.

Topic		Replies	Views
How should I use Hail on the DNANexus RAP? Hail Query & hailctl	10	2321	March 5, 2025
Best strategy for annotating and filtering VCF files using HAIL-VEP on UKB RAP? Hail Query & hailctl	3	1513	June 17, 2022
Annotate variants with genes Hail Query & hailctl	9	484	May 22, 2023
New hail and gnomAD, setup help is needed badly :-) Help [0.1]	19	3803	May 17, 2017
Can't export to plink/bgen/vcf on DNAnexus Hail Query & hailctl	5	646	September 28, 2022

Annotating with CADD, gnomad, Clinvar & dbNSFP on UKB RAP

Related topics