GWAS on subset of UKBioBank

tpoterba · March 23, 2021, 7:32pm

The rough strategy that will probably work best here is:

prepare a matrix table with genotypes, and a matrix table with phenotypes (orientation of this second matrix doesn’t really matter)
Using some batch size B, annotate B phenotypes from the phenotypes matrix onto the genotype matrix columns, then run linear_regression_rows against all B phenotypes together. There are optimizations inside of Hail’s linear_regression_rows when running on multiple phenotypes at the same time – we can use BLAS3 operations instead of BLAS2, which is much more efficient. Finally, write the linear regression results from this batch to a file.
Repeat #2 for all batches.

Some specifics:

preparing phenotype matrix

You can absolutely use Hail for this. Something like:

pheno_mt = hl.import_matrix_table(idpsFile, no_header=True)
pheno_mt.write('some_path.mt')

Running a batch will look something like:

batch_size = ...
batch_index = ...

phenos_mt = hl.read_matrix_table('some_path.mt')

pheno_start = batch_size * batch_index
pheno_end = pheno_start + batch_size

# rows are samples, cols are  phenotypes, right?
phenos_mt = phenos_mt.filter_cols((phenos_mt.col_id >= pheno_start) & (phenos_mt.col_id < pheno_end))
phenos_table = phenos_mt.annotate_rows(phenos = hl.agg.collect(phenos_mt.x)).rows()

genos_mt = hl.read_matrix_table('some_other_path.mt')
genos_mt = genos_mt.add_col_idx() # we are joining by index, right?
genos_mt = genos_mt.annotate_cols(phenos = phenos_table[genos_mt.col_idx].phenos)

batch_results = hl.linear_regression_rows(
    [genos_mt.phenos[i] for i in range(batch_size)],
    genos_mt.GT.n_alt_alleles(),
    ...covariates...,
    block_size=128, #you can use more than the default 16 because you only have 11k samples)
batch_results.write(...)

a caveat

This approach relies on the phenotypes being non-missing. If phenotypes are sometimes missing, you’ll need a slightly different strategy – the batched linear regression runs regression within a group on the set of samples that are non-missing for every phenotype and every covariate. If you have phenotypes with disparate missingness patterns, you can end up with no samples in that set!

Topic		Replies	Views
How to run GWAS from UK Biobank efficiently on Hail Hail Query & hailctl	11	3318	December 21, 2020
GWAS results with 0 rows Hail Query & hailctl	16	570	August 23, 2021
Annotate column with a list, then merge two matrix tables Hail Query & hailctl	13	1154	June 8, 2021
Best practices for UK Biobank Imputed Data Hail Query & hailctl	9	1378	March 19, 2022
PheWAS on DNAnexus UKB RAP Hail Query & hailctl	13	853	December 21, 2022

GWAS on subset of UKBioBank

preparing phenotype matrix

a caveat

Related topics