How to run GWAS from UK Biobank efficiently on Hail

chenll0105 · May 7, 2020, 6:59am

Hi, thanks a lot for developing Hail. I have successfully run a GWAS with more than 10M variants and 300K samples from UK Biobank using hail. However, it takes about 3 hours to run chromosome 1 and a total of 48 hours was needed to complete the full analysis. When I used fastGWA provided by GCTA to run my GWAS, it took only 7 hours. I wonder whether there is a more efficient way to run such a large GWAS on Hail?

danking · May 7, 2020, 3:28pm

Hi @chenll0105,

I’m sorry to hear you’re experiencing poor performance with Hail. It sounds like you’re running a genome-wide linear mixed model, not a simple genome-wide linear regression. Hail’s linear mixed models are not as fast as other mixed model tools currently. You might also investigate SAIGE.

If you’re not running a linear mixed model, I need more information before I can help you. Are you running Hail on a cluster? If so, which cloud provider or is it on-premises? Are you running Hail on a single computer? If so, which cloud provider or is it on-premises?

A common performance problem is not using fast linear algebra libraries. Did you install the BLAS and LAPACK libraries on every worker node?

chenll0105 · May 8, 2020, 3:17am

Hi, @danking. Thanks a lot for your kind reply. We were running hail on a single computer server, which has 52 cores, 196 Gb memory and SATA3 mechanical hard disk on raid 5. The server was not on any cloud. We actually ran hail for linear regression (hl.linear_regression_rows()) not the mixed linear model.

Below is my logging information when initiating hail:

hl.init()
20/05/08 11:09:10 WARN Utils: Your hostname, sugon resolves to a loopback address: 127.0.1.1; using 192.168.3.32 instead (on interface eno1)
20/05/08 11:09:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/05/08 11:09:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/05/08 11:09:11 WARN Hail: This Hail JAR was compiled for Spark 2.4.0, running with Spark 2.4.1.
Compatibility is not guaranteed.
Running on Apache Spark version 2.4.1
SparkUI available at http://sugon:4040
Welcome to
__ __ <>__
/ // /__ __/ /
/ __ / _ `/ / /
// //_,/// version 0.2.36-ed011219dd93
LOGGING: writing to /data/chenll/UKB41476/nap/test/imp_data/hail-20200508-1109-0.2.36-ed011219dd93.log

danking · May 8, 2020, 12:35pm

Can you share the script you ran? Also, did you confirm you have OpenBLAS installed? That log statement occurs much later during the regression.

chenll0105 · May 9, 2020, 7:27am

I have confirmed that OpenBLAS and LAPACK were installed. Below is my script:

import hail as hl
import pandas as pd
import os
import sys
pwd = ‘/test/imp_data/’
os.chdir(pwd)
chr = sys.argv[1]
data_path=‘/data/raw/UKB-41476/’
bgenFile = data_path+‘imp_data/ukb_imp_chr’+chr+‘_v3.bgen’
sampleFile = data_path+‘imp_data/ukb53562_imp_chr’+chr+‘_v3_s487296.sample’
mfiFile = data_path+‘imp_data/ukb_mfi_chr’+chr+‘_v3.txt’

initiate hail environment

hl.init(min_block_size=128)

import phenotype info

sample_info = hl.import_table(
pwd+‘/target_pheno_1190.csv’,
key=‘f_eid’,
types={
‘f_eid’:hl.tstr, ‘female’:hl.tbool, ‘age’:hl.tfloat64,
‘f_1190_0_0’:hl.tfloat64, ‘shift_worker’:hl.tbool,
‘PC1’:hl.tfloat64, ‘PC2’:hl.tfloat64,
‘PC3’:hl.tfloat64, ‘PC4’:hl.tfloat64
},
delimiter=‘,’
)

mfi_info = hl.import_table(
mfiFile,
no_header=True,
key=‘f0’,
types={
‘f0’:hl.tstr,‘f1’:hl.tstr,‘f2’:hl.tint,
‘f3’:hl.tstr,‘f4’:hl.tstr,‘f5’:hl.tfloat64,
‘f6’:hl.tstr,‘f7’:hl.tfloat64
}
)

mfi_info=mfi_info.drop(‘f1’,‘f3’,‘f4’,‘f5’,‘f6’) # drop fields from the table
mfi_info=mfi_info.rename({‘f0’:‘variant’,‘f2’:‘bp’,‘f7’:‘INFO’}) # rename the fields of a Table

import genotype information

genotype_info = hl.import_bgen(
path=bgenFile,
entry_fields=[‘GT’,‘GP’,‘dosage’],
sample_file=sampleFile
)

add mfi information to variants

data=genotype_info.annotate_rows(mfi=mfi_info[genotype_info.row.varid])
data=data.filter_rows(data.mfi.INFO>0.8)

varinat qc to get allele frequency

data = hl.variant_qc(data)

data = data.filter_rows(
(data.variant_qc.AF[1]>=0.001)
&(data.variant_qc.p_value_hwe>1e-10) # filter by HWE
)

add phenotype info to matrixTable columns

data=data.annotate_cols(pheno=sample_info[data.col.s])

filter out NA samples

data=data.filter_cols(~hl.is_nan(data.pheno.f_1190_0_0))

run association anaysis using linear regression

gwas = hl.linear_regression_rows(
y=data.pheno.f_1190_0_0,
x=data.GT.n_alt_alleles(),
covariates=[1.0,
data.pheno.female, # a logistic variable
data.pheno.shift_worker, # a logistic variable
data.pheno.age,
data.pheno.PC1,
data.pheno.PC2,
data.pheno.PC3,
data.pheno.PC4
]
)

add allele freqency and rsid to gwas result

gwas = gwas.join(
data.rows(),
how=‘inner’
)

gwas=gwas.key_by(‘varid’)

gwas = gwas.annotate(chr=hl.str(chr))
gwas = gwas.select(
‘n’, ‘beta’,
‘standard_error’,
‘p_value’, ‘chr’,
bp=hl.str(gwas.mfi.bp),
oa=gwas.alleles[0],
ea=gwas.alleles[1],
eaf=gwas.variant_qc.AF[1]
)

gwas = gwas.annotate(variant=gwas.chr+‘:’+gwas.bp+‘:’+gwas.oa+‘:’+gwas.ea)
gwas = gwas.key_by(‘variant’)
gwas = gwas.select(
‘chr’,‘bp’,‘ea’,‘oa’,‘eaf’,‘n’,
‘beta’,‘standard_error’,‘p_value’
)

gwas.write(‘out_path/nap_chr’+chr+‘_gwas.ht’)
hl.read_table(‘out_path/nap_chr’+chr+‘_gwas.ht’).export(‘out_path/nap_chr’+chr+‘_gwas.tsv.bgz’, header=True, delimiter=‘\t’)
hl.stop()

danking · May 9, 2020, 5:14pm

Great, there’s a few issues here that we can easily address.

First, remove this line:

gwas=gwas.key_by(‘varid’)

That unnecessarily reorders the entire data to be keyed by RSID, an expensive an unnecessary operation.

The BGEN file is imported with all three fields: GT, GP, and dosage, but you only ever use GT. Change it to only list the fields you need.

genotype_info = hl.import_bgen(
path=bgenFile,
entry_fields=[‘GT’,‘GP’,‘dosage’],
sample_file=sampleFile
)

Moreover, you perform a GWAS on the GT field:

hl.linear_regression_rows(
y=data.pheno.f_1190_0_0,
x=data.GT.n_alt_alleles(),
...
)

But BGEN files store probabilities (GP). In order to do a GWAS of the GT field, Hail has to calculate the GT from the probabilities, which is an expensive operation to do for every single genotype! Are you sure you do not want to run a linear regression of the dosage? Or perhaps one of the GP probabilities?

On this line you perform a join between mfi_info and genotype_info on the varid. BGEN files are ordered by locus and alleles, not varid, so this operation has to re-order your entire mfi_info file to be ordered by locus and alleles. Is it possible to change this file so that it is sorted by locus and alleles rather than varid? That will substantially improve performance.

data=genotype_info.annotate_rows(mfi=mfi_info[genotype_info.row.varid])

Don’t do this, it causes hail to read the BGEN data twice. I think you’ll benefit from reading this guide, in particular note that Hail is lazy. Instead of doing this, use the pass_through argument to linear_regression_rows.

gwas = gwas.join(
data.rows(),
how=‘inner’
)

You almost certainly do not want to do the following. Your data is more useful when keyed by Locus and Alleles instead of by this constructed string. Skip this step.

gwas = gwas.annotate(variant=gwas.chr+’:’+gwas.bp+’:’+gwas.oa+’:’+gwas.ea)
gwas = gwas.key_by(‘variant’)

Instead, do it just before exporting, and don’t change the key. Changing the key of a hail table changes its order. You want the same ordering (locus and alleles) but with a colon-separated representation of the locus and alleles:

gwas.write('out_path/nap_chr’+chr+’_gwas.ht')
t = hl.read_table('out_path/nap_chr'+chr+'_gwas.ht')
t = t.key_by()
t = t.annotate(variant = hl.variant_str(t.locus, t.alleles)
t = t.select(
'variant', 'eaf','n', 'beta','standard_error','p_value'
)
t.export(‘out_path/nap_chr’+chr+’_gwas.tsv.bgz’, header=True, delimiter=’\t’)

Also, note that this produces one big file for the entire genome which means that hail has to sequentially concatenate all the partitioned results. This is slow. If you absolutely need one file, then there’s not much you can do. If you can instead work with a partitioned series of files consider using the parallel argument to export. Hail, for example, can efficiently operate on a partitioned series of files.

Let me know if this improves performance enough. If not, I think the next thing to look at is the hail log file. It’s location is printed when you start Hail.

chenll0105 · May 10, 2020, 8:49am

Thanks a lot, @danking! I have adopted your suggestions and the total time for chr 21 has been reduced to 16 minutes, which is half of the original. However, I could not create keys for mfi_info file, below is my code:

mfi_name = [‘variant’,‘rsid’,‘bp’,‘ref’,‘alt’,‘maf’,‘ma’,‘info’]
mfi_info = pd.read_table(
mfiFile,
header=None,
names=mfi_name,
dtype=‘str’
)
mfi_info[‘locus’] = chr+‘:’+mfi_info.bp
alleles=mfi_info.loc[:,[‘alt’,‘ref’]]
alleles = alleles.values.tolist()
mfi_info[‘alleles’]=alleles
mfi_info=mfi_info.loc[:,[‘locus’,‘bp’,‘alleles’,‘info’]]
mfi_info.to_csv(‘chr_’+chr+‘mfi_info.csv’,na_rep=0,sep=‘\t’,index=0)
mfi_info = hl.import_table(
‘chr_’+chr+‘mfi_info.csv’,
key=[‘locus’,‘alleles’],
types={
‘locus’:hl.tlocus(reference_genome=‘GRCh37’),
‘alleles’:hl.tarray(hl.tstr),
‘bp’:hl.tstr, ‘info’:hl.tfloat64
}
)
mfi_info.show()

Here is the error information:
Hail version: 0.2.36-ed011219dd93
Error summary: JsonParseException: Unexpected character (‘’’ (code 39)): expected a valid value (JSON String, Number, Array, Object or token ‘null’,
‘true’ or ‘false’)
at [Source: UNKNOWN; line: 1, column: 3]

I have tried many times and found that the error was cuased by the array ‘alleles’. Cound you help me figure it out ? Thanks a lot.

danking · May 10, 2020, 6:33pm

When you specify the type (using the types argument to import_table) of a column (in this case, alleles) as something other than an hl.tint, hl.tfloat, hl.tlocus, hl.tbool, or hl.tstr, Hail tries to parse that column as JSON. In JSON, an array of strings should look like ["foo", "bar", "baz"].

I suspect your alleles look like: A,T,G. To parse this you should instead use 'alleles': hl.tstr and then on the next line you can do:

mfi_info = mfi_info.key_by(locus = mfi_info.locus,
                           alleles = mfi_info.split(','))

Hail will need to scan the file to verify that it’s ordered correctly after the string-splitting ( ), but that should be quite quick.

chenll0105 · May 11, 2020, 6:13am

Thanks a lot, @danking! The whole time for running chr21 is about 12 minutes now.

Below is my current scripts:

import hail as hl
import pandas as pd
import os
import sys
chr = sys.argv[1]
pwd = ‘/test/imp_data/’
os.chdir(pwd)
data_path=‘/data/raw/UKB-41476/’
bgenFile = data_path+‘imp_data/ukb_imp_chr’+chr+‘_v3.bgen’
sampleFile = data_path+‘imp_data/ukb53562_imp_chr’+chr+‘_v3_s487296.sample’
mfiFile = data_path+‘imp_data/ukb_mfi_chr’+chr+‘_v3.txt’

initiate hail environment

hl.init(min_block_size=128)

import phenotype info

sample_info = hl.import_table(
pwd+‘/target_pheno_1190.csv’,
key=‘f_eid’,
types={
‘f_eid’:hl.tstr, ‘female’:hl.tbool, ‘age’:hl.tfloat64,
‘f_1190_0_0’:hl.tfloat64, ‘shift_worker’:hl.tbool,
‘PC1’:hl.tfloat64, ‘PC2’:hl.tfloat64,
‘PC3’:hl.tfloat64, ‘PC4’:hl.tfloat64
},
delimiter=‘,’
)
mfi_info = hl.import_table(
pwd+‘/mfi_info_chr’+chr+‘.tsv’,
types={
‘locus’:hl.tlocus(reference_genome=‘GRCh37’),
‘bp’:hl.tstr, ‘INFO’:hl.tfloat64
},
delimiter=‘\t’
)

mfi_info = mfi_info.key_by(
locus = mfi_info.locus,
alleles = mfi_info.alleles.split(‘,’)
)

import genotype information

genotype_info = hl.import_bgen(
path=bgenFile,
entry_fields=[‘GT’],
sample_file=sampleFile
)

add mfi information to variants

data=genotype_info.annotate_rows(mfi=mfi_info[genotype_info.row_key])
data=data.filter_rows(data.mfi.INFO>0.8)

varinat qc to get allele frequency

data = hl.variant_qc(data)

data = data.filter_rows(
(data.variant_qc.AF[1]>=0.001)
&(data.variant_qc.p_value_hwe>1e-10) # filter by HWE
)

add phenotype info to matrixTable columns

data=data.annotate_cols(pheno=sample_info[data.col.s])

filter out NA samples

data=data.filter_cols(~hl.is_nan(data.pheno.f_1190_0_0))

run association anaysis using linear regression

gwas = hl.linear_regression_rows(
y=data.pheno.f_1190_0_0,
x=data.GT.n_alt_alleles(),
covariates=[1.0,
data.pheno.female, # a logistic variable
data.pheno.shift_worker, # a logistic variable
data.pheno.age,
data.pheno.PC1,
data.pheno.PC2,
data.pheno.PC3,
data.pheno.PC4
],
pass_through=[data.variant_qc.AF,data.mfi.bp]
)

add allele freqency and rsid to gwas result

gwas = gwas.annotate(chr=hl.str(chr))

gwas = gwas.select(
‘n’, ‘beta’,
‘standard_error’,
‘p_value’, ‘chr’,
bp=hl.str(gwas.bp),
oa=gwas.alleles[0],
ea=gwas.alleles[1],
eaf=gwas.AF[1]
)

gwas.write(‘out_path/nap_chr’+chr+‘_gwas.ht’)
t = hl.read_table(‘out_path/nap_chr’+chr+‘_gwas.ht’)
t = t.key_by()
t = t.annotate(variant = hl.variant_str(t.locus, t.alleles))

t = t.select(
‘variant’,‘chr’,‘bp’,‘oa’,
‘ea’,‘eaf’,‘n’, ‘beta’,
‘standard_error’,‘p_value’
)

t.export(‘out_path/nap_chr’+chr+‘_gwas.tsv.bgz’, header=True, delimiter=‘\t’)
hl.stop()

Do you think it can be further improved? The current running time for a gwas is about 18 hours,which is slightly acceptable.

danking · May 11, 2020, 1:23pm

This looks very close to what I would do. I still urge you to consider regressing on dosage.

As for run-time, the main benefit of Hail is that it scales. This has a precise meaning in computer science: if you do not change the problem, but you double the resources available, the time should reduce to one-half the original time. If Hail scaled perfectly, then if you want your GWAS to finish in ~9 hours, use twice as many computer cores. If you want your GWAS to finish in ~4.5 hours, use 4x as many computer cores.

Hail does not scale perfectly, but I suspect you can dramatically reduce run-time by increasing the number of cores 4-8x. In general, for UKB, I recommend setting the number of cores to ~1/3 the number of partitions. You can determine the number of partitions with hl.import_bgen(...).n_partitions.

Under perfect scaling, increasing the number of cores will not change the cost of your analysis because Google charges you primarily in compute-hours. Compute hours is the product: n_cores * time If you double cores but halve time you haven’t changed cost. You can read more about the theory and practice of scaling with Hail in my hail cloud docs.

chenll0105 · May 11, 2020, 4:38pm

Sounds great, @danking！However，the running time seemed to increase to 20 minutes for chr 21 when regressing on dosage. Additonally, I use hail to perform GWAS on our local server without google cloud.

alartin · December 21, 2020, 3:44pm

Hi Dan,
I would say it is a nice case study of optimization and caveat of don’t do it

Topic		Replies	Views
Efficient GWAS analyses: expected time and resources Hail Query & hailctl	0	13	May 21, 2025
Improve writing time for GWAS results Hail Query & hailctl	2	461	November 20, 2020
Long time to export UK Biobank GWAS result to tsv file Hail Query & hailctl	2	908	April 27, 2020
Hail curious potential user Q Help [0.1]	8	1605	March 7, 2017
GWAS on subset of UKBioBank Hail Query & hailctl	26	1548	July 13, 2021

How to run GWAS from UK Biobank efficiently on Hail

initiate hail environment

import phenotype info

import genotype information

add mfi information to variants

varinat qc to get allele frequency

add phenotype info to matrixTable columns

filter out NA samples

run association anaysis using linear regression

add allele freqency and rsid to gwas result