Improve writing time for GWAS results

Liverpool · November 18, 2020, 5:06pm

Hi, I’m running linear regression with about 100K samples and 90K variants. After GWAS, I will also export the results. The whole process took about 2 hours which seems to be too long. Below are my codes and may I kindly ask how can I improve it? Thanks a lot!

import hail as hl
hl.init(min_block_size=128)  

### prepare the inputs:
chr='22'  
Input='phenotype.txt'
y = 'Diabetes'  # name of the phenotype in the text file

### import the covariates and phenotype info:
sample_info = hl.import_table(paths=Input, types={'IID':hl.tstr, y:hl.tint32}, impute=True).key_by('IID')
sample_info = sample_info.rename({y: 'phenotype'})  # rename the regression Y to be "phenotype"

### Import PLINK bfiles: 
mt = hl.import_plink(bed='genotype/chr'+chr+'.bed',
                     bim='genotype/chr'+chr+'.bim',
                     fam='genotype/chr'+chr+'.fam',
                     quant_pheno=True, missing='-9', 
                     reference_genome='GRCh37')

### add phenotypes to genotype matrix and keep only GWAS sample
mt = mt.annotate_cols(pheno = sample_info[mt.s])  # add phenotype and covariate info
mt = mt.filter_cols(~hl.is_nan(mt.pheno.phenotype))  # filter NA phenotypes

### variant QC
mt = hl.variant_qc(mt)  
mt = mt.filter_rows((mt.variant_qc.AF[1] >= 0.01)&
                    (mt.variant_qc.p_value_hwe >= 1e-6)&
                    (mt.variant_qc.call_rate >= 0.99))       

### run linear regressions
gwas = mt.annotate_rows(linreg = hl.agg.linreg(y = mt.pheno.phenotype, 
                                                x = [1, mt.GT.n_alt_alleles()])) 
gwas = gwas.rows().key_by()  
gwas = gwas.select(CHR = gwas.locus.contig, SNP = gwas.rsid, BP = gwas.locus.position, 
                A1 = gwas.alleles[1], A2 = gwas.alleles[0], EAF = gwas.variant_qc.AF[1],
                BETA = gwas.linreg.beta[1], SE = gwas.linreg.standard_error[1], 
                P = gwas.linreg.p_value[1], N = gwas.linreg.n)

### export result
# first write to Hail’s efficient and fast on-disk format, then read back in and convert to a text file
gwas.write('chr'+chr+'linear.ht') 
hl.read_table('chr'+chr+'linear.ht').export('chr'+chr+'linear.tsv.bgz', header=True)

tpoterba · November 20, 2020, 9:17pm

One possible code change is to use the linear_regression_rows method instead of hl.agg.linreg – it is purpose-built for efficient GWAS execution, and we’re still working on improving the performance of the linreg aggregator.

Liverpool · November 20, 2020, 9:32pm

thanks!

Topic		Replies	Views
Long time to export UK Biobank GWAS result to tsv file Hail Query & hailctl	2	908	April 27, 2020
How to run GWAS from UK Biobank efficiently on Hail Hail Query & hailctl	11	3295	December 21, 2020
Export GWAS summary statistics to a .txt file Hail Query & hailctl	8	1121	February 22, 2022
Requesting advice on efficiently parsing through many GWAS results Hail Query & hailctl	8	588	June 14, 2022
Clarification on Linear Model in Hail: genetic relatedness and covariate Hail Query & hailctl	0	17	March 28, 2025

Improve writing time for GWAS results

Related topics