Use-case for hail at our Institute

atifkhanncl · June 6, 2019, 5:30am

Dear Hail team,

I am evaluating the use case for Hail at our Institute of Genetic Medicine here at Newcastle. As part of this I am trying to replicate your GWAS (https://github.com/Nealelab/UK_Biobank_GWAS/tree/master/imputed-v2-gwas ) and gather various performance and usage indicators.
I understand you performed this on google cloud dataproc and to keep things simple I would like to do the same. Before I make a start I would like to know about your experience

  How much did it cost (just the google cloud bill) and what dataproc spec did you use to do this GWAS?

  Apart from the very useful documentation on https://github.com/Nealelab , is there any other resource that would be useful?

```
  Any other tips/advices 
```

Kind regards
Atif
Newcastle University

tpoterba · June 6, 2019, 11:53am

I can address point-by-point -

I think it cost about $4000 to do the v2 analysis, about $1 per phenotype. Due to scaling constraints in the number of simultaneous regressions that can be performed in Hail simultaneously, the team ran on something like 10 concurrent clusters of ~2000 cores, each running some subset of phenotypes.
The blog also has a lot of information.
You should definitely read over the Hail documentation – there’s a lot of good information there, and trying to replicate the UK Biobank mega GWAS without some prior experience in Hail might be tough.

atifkhanncl · June 20, 2019, 11:28am

Many Thanks Tim. That is very useful.
Is UKBB data still in the google cloud? If yes, is it possible to access it for my use-case study?

tpoterba · June 20, 2019, 1:34pm

Which UKBB data? The GWAS summary results? Those can be downloaded from links in the blog: http://www.nealelab.is/uk-biobank

atifkhanncl · June 20, 2019, 1:44pm

I mean the input UKBB data that was fed to Hail both Genotype and phenotype data
.
In a separate note I am currently attending a Broad’s GATK workshop here at Newcastle and the Broad’s team has kindly given me access to a Hail’s tutorial on Terra which is very helpful and I see within “Terra Library” there is UKBB dataset (https://app.terra.bio/#library/datasets) but I don’t have access to this dataset. If there is away I can get access here?

tpoterba · June 20, 2019, 4:12pm

input UKBB data

This data is not public, you’ll need to apply for research use access.

atifkhanncl · June 20, 2019, 4:25pm

Thanks Tim. We have an approved application but we are worried about the time & money involved in uploading the UKBB data onto google cloud. And I am assuming that you already have this sitting in a bucket and wanted to leverage that. Our end goal in addition to evaluate use-case is also to test reproducibility of what you (Broad) have done and publish that work.

Also, Do you know (ballpark) how much cost & time it took to upload UKBB data? And what was the size of UKBB data please ?

tpoterba · June 20, 2019, 10:11pm

you already have this sitting in a bucket and wanted to leverage that

That’s true, but each access application has its own version of the data, so it’s not possible to share that.

I don’t know how long the upload took, but I think the BGEN is about 2 TB, which costs about $50 per month to store on the cloud.

Topic		Replies	Views
Best practices for UK Biobank Imputed Data Hail Query & hailctl	9	1371	March 19, 2022
GWAS on subset of UKBioBank Hail Query & hailctl	26	1572	July 13, 2021
Efficient GWAS analyses: expected time and resources Hail Query & hailctl	0	25	May 21, 2025
Big picture issues: considering switching to HAIL Meta	6	3889	January 3, 2023
How to run GWAS from UK Biobank efficiently on Hail Hail Query & hailctl	11	3310	December 21, 2020

Use-case for hail at our Institute

Related topics