GWAS on subset of UKBioBank

Thanks for the info! I’m not worried too much about compute time, I just want to make sure I’m preparing in the most efficient way as possible. I’ll be working on the Biowulf cluster at the NIH.

EDIT: Once I know the most efficient way to use Hail, then I’ll benchmark the jobs and I can work with the cluster support staff to efficiently request resources.

I’m having an issue in my process of subsetting and merging the genotype and phenotype data.

# phenotype
# 11533 x 79800 csv
pheno_mt = hl.import_matrix_table(idpsFile, no_header=True, entry_type=hl.tfloat, delimiter=',')
subjs # list of 11533 strings of subject IDs (samples)

# genotype
genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile)
genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))

When I subset genotype_info based on subjs, only 11224 samples are returned. I take it this means I don’t have genotype data for some of the subjects I have phenotype data for. That’s fine.

However, I want to merge these by sample / s / subjs. Since subjs and pheno_mt are separate I’m not sure how to perform this merge. Ideally I’d have the row_key for pheno_mt be subjs.

I’ve tried to use genotype_info_subset.s to subset subj

TypeError: list indices must be integers or slices, not StringExpression

I’ve tried to annotate columns of pheno_mt with subjs but this puts the entire subject list at each row. And I’ve tried to replace the row key of pheno_mt with pheno_mt.key_cols_by(subjs):

TypeError: key_cols_by: parameter '*keys' (arg 0 of 1): expected (str or hail.expr.expressions.base_expression.Expression), found list:

@danking since I’m wanting to run 79,800 separate gwases, distributed across our cluster. Which approach do you think is more effeicient?

  1. Merge the geno matrix table with the full pheno table, then write a script to run a gwas on a single annotation (eg a single column from the pheno table)


  1. Write a script that subsets the pheno table for each iteration before merging pheno and geno? Then run the gwas.

Either way I’ll need a way to preserve the pheno column index, so I’ll likely write a script that takes a column index as an argument.

I think you want option 3 –

  • Divide phenotypes into groups of BATCH_SIZE (try 16 or 32 to start).
  • For each batch, read pheno table, select phenos in current batch, annotate_cols to join onto geno MT.
  • Run linear_regression_rows on the batch of phenotypes at the same time, which can take advantage of BLAS3 routines that save a lot of time over multiple calls to BLAS2 routines.
  • export results for this batch to a file
1 Like

Ah good call, thanks.
You suggested this earlier in the thread, sorry to make you repeat yourself!

@tpoterba your method seems to be working well, though saving the results takes a really really long time. Is that to be expected?

batch_results = hl.linear_regression_rows(
    [geno_subset_mt.phenos[i] for i in range(batch_size)],

batch_results.write('test') # takes a really long time

Yes, absolutely! Hail’s interfaces are lazy; no computational work is happening until your write line. The time for the write includes the time for the regressions (which is the majority of the work).

1 Like