i’m just wondering if you can specify cadd, gnomad, clinvar and dbNSFP options when annotating with hail on dxjupyterlab_spark_cluster o the UKB RAP? From the hail website, the following command can be used on your matrix file to annotate with these features:
db = hl.experimental.DB(region=‘us’, cloud=‘gcp’)
mt = db.annotate_rows_db(mt, ‘CADD’, ‘clinvar_gene_summary’, ‘clinvar_variant_summary’, ‘dbNSFP_genes’, ‘dbNSFP_variants’, ‘dbSNP_rsid’, ‘gnomad_exome_sites’)
weblink: Hail | Annotation Database
Unfortunately, this command does not work on hail when using the spark jupyterlab python3 console. The error that is given is:
Hail version: 0.2.78-b17627756568
Error summary: IOException: No FileSystem for scheme: gs
I know the EU version of this hail command is not available but it is available on the US version. Since the UKB RAP is based in London, is there any workaround for this?
Any help with this would be great.
Before we chase this down further, could you update to the latest version? 0.2.78 is pretty old.
This means the version of Spark made available to you by DNANexus lacks the Google Cloud Storage “hadoop connector.” Moreover, AFAIK, DNANexus is a thin wrapper around AWS. You won’t be able to read data out of Google Cloud Storage into AWS without incurring significant cost. You should explicitly set the cloud to
db = hl.experimental.DB(region='us', cloud='aws')
Ah ok, and will this workaround the fact that the UKB RAP uses London based AWS services?
Also, Hail 0.2.78 is the only version of Hail provided by the UKB RAP afaik, unless there is a way of updating this manually on the RAP.
Ah, yeah, this is a good point. You’ll pay 0.02 USD per GB to stream the data out of our US East buckets to Europe. Most of the datasets you’re referencing aren’t that large, but it is a cost. If you’ll be using these datasets frequently, you can also copy them directly out of our S3 bucket into one you control in the London region. We ship a JSON file with the pip package which identifies the URL for every dataset. You can also see that file in our GitHub repository.
Ah cheers. It seems to run when ‘aws’ is specified. Just wondering when exporting to a .tsv file how do I specify exporting the annotate_db columns? Before when I was exporting my vep annotations I used this code:
annotated_mt2.vep.export(“file:///opt/notebooks/ukb23148_c1_b0_v1.annotate.tsv.gz”, header=True, delimiter=’\t’)
Take a look at
mt.describe(). Each dataset should be added under its own name. You can export any Hail Table as a TSV, so consider doing this:
ht = mt.rows()
ht.select('CADD', 'clinvar_gene_summary', ...).export('...', ...)
I’d also give a careful thought to why you need a TSV. You can store all this information in an efficient binary format, the Hail Table format:
ht = mt.rows()
Writing a TSV, in contrast, requires concatenation of the TSV into one big TSV. This requires streaming through all the data on a single core. As such, it’s somewhat slow.
Cheers for the info, I’m just trying to export this now as a VCF. I want to export as a VCF as I want to filter my variants on my own server and not on DNAnexus. In order to export as a vcf I had to manually select which annotations I would like in the info field using this code:
annotated_mt = annotated_mt.annotate_rows(info=annotated_mt.info.annotate(CADD_PHRED=annotated_mt.CADD[“PHRED_score”]))
However, when I try exporting as a VCF I get the following error:
Error summary: VCFParseError: unexpected end of line
I can export as a.tsv file from a hail table but this does not contain the genotype information so it makes it a bit trickier for variant filtering on DNAnexus. As well as this, some annotations were not inputted into the info field as simple strings or floats and could not be exported. Example:
annotated_mt = annotated_mt.annotate_rows(info=annotated_mt.info.annotate(gnomAD_pLI = annotated_mt.dbNSFP_genes[“gnomAD_pLI”]))
Just wondering if there is a simpler way to export annotations in a VCF file for variant filtering.