The rough strategy that will probably work best here is:
- prepare a matrix table with genotypes, and a matrix table with phenotypes (orientation of this second matrix doesn’t really matter)
- Using some batch size B, annotate B phenotypes from the phenotypes matrix onto the genotype matrix columns, then run linear_regression_rows against all B phenotypes together. There are optimizations inside of Hail’s linear_regression_rows when running on multiple phenotypes at the same time – we can use BLAS3 operations instead of BLAS2, which is much more efficient. Finally, write the linear regression results from this batch to a file.
- Repeat #2 for all batches.
Some specifics:
preparing phenotype matrix
You can absolutely use Hail for this. Something like:
pheno_mt = hl.import_matrix_table(idpsFile, no_header=True)
pheno_mt.write('some_path.mt')
Running a batch will look something like:
batch_size = ...
batch_index = ...
phenos_mt = hl.read_matrix_table('some_path.mt')
pheno_start = batch_size * batch_index
pheno_end = pheno_start + batch_size
# rows are samples, cols are phenotypes, right?
phenos_mt = phenos_mt.filter_cols((phenos_mt.col_id >= pheno_start) & (phenos_mt.col_id < pheno_end))
phenos_table = phenos_mt.annotate_rows(phenos = hl.agg.collect(phenos_mt.x)).rows()
genos_mt = hl.read_matrix_table('some_other_path.mt')
genos_mt = genos_mt.add_col_idx() # we are joining by index, right?
genos_mt = genos_mt.annotate_cols(phenos = phenos_table[genos_mt.col_idx].phenos)
batch_results = hl.linear_regression_rows(
[genos_mt.phenos[i] for i in range(batch_size)],
genos_mt.GT.n_alt_alleles(),
...covariates...,
block_size=128, #you can use more than the default 16 because you only have 11k samples)
batch_results.write(...)
a caveat
This approach relies on the phenotypes being non-missing. If phenotypes are sometimes missing, you’ll need a slightly different strategy – the batched linear regression runs regression within a group on the set of samples that are non-missing for every phenotype and every covariate. If you have phenotypes with disparate missingness patterns, you can end up with no samples in that set!