I’m trying to run a GWAS (logistic regression) for about 369,000 subjects using UKBB data on Google Cloud. After filtering out SNP’s based on info score and minor allele frequency, I think I’ll have about 1 million SNP’s per chromosome.
When I create my cluster, how should I determine the most cost effective number of pre-emptible workers? And when I use the import_bgen function to import the BGEN file, how many partitions would be optimal?
Probably best to leave the partitions as default. Don’t write to a matrix table – just work off the BGEN.
I would recommend building your SNP inclusion list, writing that as a table, and then using that to filter the BGEN each time you use it (rather than writing an intermediate matrix table, or computing info score every time).
I think you should be fine with around 500 worker cores (50-70 workers of default size)
Thank you! That’s very helpful. Below is a simplified version of the code I’ve written starting with a table of variants called “variants”. Does it seem reasonable?
# load the SNP's from the .bgen file
mt = hl.import_bgen("data.bgen", entry_fields=['GP'], sample_file="sample", variants=variants)
# load the phenotypes
phenos = hl.import_table("phenotypes.txt", impute=True, key='id')
# add phenotype columns to MatrixTable mt
mt = mt.annotate_cols(pheno = phenos[mt.s])
# filter out people without a phenotype
mt = mt.filter_cols(mt.pheno.phenotype > 0)
# GWAS (recessive analysis)
gwas = hl.logistic_regression_rows(y=mt.pheno.phenotype, x=mt.GP[2], test='wald',covariates=[1.0,mt.pheno.age,mt.pheno.sex])
# export
gwas.export("gwas_results.tsv.bgz")