GWAS on subset of UKBioBank

The rough strategy that will probably work best here is:

  1. prepare a matrix table with genotypes, and a matrix table with phenotypes (orientation of this second matrix doesn’t really matter)
  2. Using some batch size B, annotate B phenotypes from the phenotypes matrix onto the genotype matrix columns, then run linear_regression_rows against all B phenotypes together. There are optimizations inside of Hail’s linear_regression_rows when running on multiple phenotypes at the same time – we can use BLAS3 operations instead of BLAS2, which is much more efficient. Finally, write the linear regression results from this batch to a file.
  3. Repeat #2 for all batches.

Some specifics:

preparing phenotype matrix

You can absolutely use Hail for this. Something like:

pheno_mt = hl.import_matrix_table(idpsFile, no_header=True)
pheno_mt.write('some_path.mt')

Running a batch will look something like:

batch_size = ...
batch_index = ...

phenos_mt = hl.read_matrix_table('some_path.mt')

pheno_start = batch_size * batch_index
pheno_end = pheno_start + batch_size

# rows are samples, cols are  phenotypes, right?
phenos_mt = phenos_mt.filter_cols((phenos_mt.col_id >= pheno_start) & (phenos_mt.col_id < pheno_end))
phenos_table = phenos_mt.annotate_rows(phenos = hl.agg.collect(phenos_mt.x)).rows()

genos_mt = hl.read_matrix_table('some_other_path.mt')
genos_mt = genos_mt.add_col_idx() # we are joining by index, right?
genos_mt = genos_mt.annotate_cols(phenos = phenos_table[genos_mt.col_idx].phenos)

batch_results = hl.linear_regression_rows(
    [genos_mt.phenos[i] for i in range(batch_size)],
    genos_mt.GT.n_alt_alleles(),
    ...covariates...,
    block_size=128, #you can use more than the default 16 because you only have 11k samples)
batch_results.write(...)

a caveat

This approach relies on the phenotypes being non-missing. If phenotypes are sometimes missing, you’ll need a slightly different strategy – the batched linear regression runs regression within a group on the set of samples that are non-missing for every phenotype and every covariate. If you have phenotypes with disparate missingness patterns, you can end up with no samples in that set!