Use-case for hail at our Institute

Dear Hail team,

I am evaluating the use case for Hail at our Institute of Genetic Medicine here at Newcastle. As part of this I am trying to replicate your GWAS (https://github.com/Nealelab/UK_Biobank_GWAS/tree/master/imputed-v2-gwas ) and gather various performance and usage indicators.
I understand you performed this on google cloud dataproc and to keep things simple I would like to do the same. Before I make a start I would like to know about your experience

  1.   How much did it cost (just the google cloud bill) and what dataproc spec did you use to do this GWAS?
    
  2.   Apart from the very useful documentation on https://github.com/Nealelab , is there any other resource that would be useful?
    
  3.   Any other tips/advices 
    

Kind regards
Atif
Newcastle University

I can address point-by-point -

  1. I think it cost about $4000 to do the v2 analysis, about $1 per phenotype. Due to scaling constraints in the number of simultaneous regressions that can be performed in Hail simultaneously, the team ran on something like 10 concurrent clusters of ~2000 cores, each running some subset of phenotypes.
  2. The blog also has a lot of information.
  3. You should definitely read over the Hail documentation – there’s a lot of good information there, and trying to replicate the UK Biobank mega GWAS without some prior experience in Hail might be tough.

Many Thanks Tim. That is very useful.
Is UKBB data still in the google cloud? If yes, is it possible to access it for my use-case study?

Which UKBB data? The GWAS summary results? Those can be downloaded from links in the blog: http://www.nealelab.is/uk-biobank

I mean the input UKBB data that was fed to Hail both Genotype and phenotype data
.
In a separate note I am currently attending a Broad’s GATK workshop here at Newcastle and the Broad’s team has kindly given me access to a Hail’s tutorial on Terra which is very helpful and I see within “Terra Library” there is UKBB dataset (https://app.terra.bio/#library/datasets) but I don’t have access to this dataset. If there is away I can get access here?

input UKBB data

This data is not public, you’ll need to apply for research use access.

Thanks Tim. We have an approved application but we are worried about the time & money involved in uploading the UKBB data onto google cloud. And I am assuming that you already have this sitting in a bucket and wanted to leverage that. Our end goal in addition to evaluate use-case is also to test reproducibility of what you (Broad) have done and publish that work.

Also, Do you know (ballpark) how much cost & time it took to upload UKBB data? And what was the size of UKBB data please ?

you already have this sitting in a bucket and wanted to leverage that

That’s true, but each access application has its own version of the data, so it’s not possible to share that.

I don’t know how long the upload took, but I think the BGEN is about 2 TB, which costs about $50 per month to store on the cloud.