GWAS on subset of UKBioBank

dmoracze · April 13, 2021, 3:33pm

Thanks for the info! I’m not worried too much about compute time, I just want to make sure I’m preparing in the most efficient way as possible. I’ll be working on the Biowulf cluster at the NIH.

EDIT: Once I know the most efficient way to use Hail, then I’ll benchmark the jobs and I can work with the cluster support staff to efficiently request resources.

dmoracze · April 13, 2021, 5:43pm

I’m having an issue in my process of subsetting and merging the genotype and phenotype data.

# phenotype
# 11533 x 79800 csv
pheno_mt = hl.import_matrix_table(idpsFile, no_header=True, entry_type=hl.tfloat, delimiter=',')
subjs # list of 11533 strings of subject IDs (samples)

# genotype
genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile)
genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))

When I subset genotype_info based on subjs, only 11224 samples are returned. I take it this means I don’t have genotype data for some of the subjects I have phenotype data for. That’s fine.

However, I want to merge these by sample / s / subjs. Since subjs and pheno_mt are separate I’m not sure how to perform this merge. Ideally I’d have the row_key for pheno_mt be subjs.

I’ve tried to use genotype_info_subset.s to subset subj

subjs[genotype_info_subset.s]
TypeError: list indices must be integers or slices, not StringExpression

I’ve tried to annotate columns of pheno_mt with subjs but this puts the entire subject list at each row. And I’ve tried to replace the row key of pheno_mt with pheno_mt.key_cols_by(subjs):

TypeError: key_cols_by: parameter '*keys' (arg 0 of 1): expected (str or hail.expr.expressions.base_expression.Expression), found list:

dmoracze · June 1, 2021, 5:08pm

@danking since I’m wanting to run 79,800 separate gwases, distributed across our cluster. Which approach do you think is more effeicient?

Merge the geno matrix table with the full pheno table, then write a script to run a gwas on a single annotation (eg a single column from the pheno table)

or

Write a script that subsets the pheno table for each iteration before merging pheno and geno? Then run the gwas.

Either way I’ll need a way to preserve the pheno column index, so I’ll likely write a script that takes a column index as an argument.

tpoterba · June 1, 2021, 5:11pm

I think you want option 3 –

Divide phenotypes into groups of BATCH_SIZE (try 16 or 32 to start).
For each batch, read pheno table, select phenos in current batch, annotate_cols to join onto geno MT.
Run linear_regression_rows on the batch of phenotypes at the same time, which can take advantage of BLAS3 routines that save a lot of time over multiple calls to BLAS2 routines.
export results for this batch to a file

dmoracze · June 1, 2021, 5:13pm

Ah good call, thanks.
You suggested this earlier in the thread, sorry to make you repeat yourself!

dmoracze · July 12, 2021, 9:57pm

@tpoterba your method seems to be working well, though saving the results takes a really really long time. Is that to be expected?

batch_results = hl.linear_regression_rows(
    [geno_subset_mt.phenos[i] for i in range(batch_size)],
    geno_subset_mt.dosage,
    covariates=[1.0],
    block_size=128)

batch_results.write('test') # takes a really long time

tpoterba · July 13, 2021, 12:12am

Yes, absolutely! Hail’s interfaces are lazy; no computational work is happening until your write line. The time for the write includes the time for the regressions (which is the majority of the work).

Topic		Replies	Views
Use-case for hail at our Institute Science	7	1062	June 20, 2019
How to run GWAS from UK Biobank efficiently on Hail Hail Query & hailctl	11	3306	December 21, 2020
Optimizing partitions and workers for UKBB analysis Hail Query & hailctl	3	516	August 29, 2019
Efficient GWAS analyses: expected time and resources Hail Query & hailctl	0	24	May 21, 2025
Table from pandas dataframe/aggregate problem Hail Query & hailctl	20	1487	January 23, 2020

GWAS on subset of UKBioBank

Related topics