Optimizing partitions and workers for UKBB analysis

I’m trying to run a GWAS (logistic regression) for about 369,000 subjects using UKBB data on Google Cloud. After filtering out SNP’s based on info score and minor allele frequency, I think I’ll have about 1 million SNP’s per chromosome.

When I create my cluster, how should I determine the most cost effective number of pre-emptible workers? And when I use the import_bgen function to import the BGEN file, how many partitions would be optimal?

Any tips would be greatly appreciated. Thanks!

Probably best to leave the partitions as default. Don’t write to a matrix table – just work off the BGEN.

I would recommend building your SNP inclusion list, writing that as a table, and then using that to filter the BGEN each time you use it (rather than writing an intermediate matrix table, or computing info score every time).

I think you should be fine with around 500 worker cores (50-70 workers of default size)

Thank you! That’s very helpful. Below is a simplified version of the code I’ve written starting with a table of variants called “variants”. Does it seem reasonable?

# load the SNP's from the .bgen file
mt = hl.import_bgen("data.bgen", entry_fields=['GP'], sample_file="sample", variants=variants)

# load the phenotypes
phenos = hl.import_table("phenotypes.txt", impute=True, key='id')

# add phenotype columns to MatrixTable mt
mt = mt.annotate_cols(pheno = phenos[mt.s])

# filter out people without a phenotype
mt = mt.filter_cols(mt.pheno.phenotype > 0)

# GWAS (recessive analysis)
gwas = hl.logistic_regression_rows(y=mt.pheno.phenotype, x=mt.GP[2], test='wald',covariates=[1.0,mt.pheno.age,mt.pheno.sex])

# export

Yes, this seems pretty optimal. It’ll still be somewhat expensive – logistic regression requires a lot of compute.