Optimizing partitions and workers for UKBB analysis

Mark · August 29, 2019, 5:55pm

I’m trying to run a GWAS (logistic regression) for about 369,000 subjects using UKBB data on Google Cloud. After filtering out SNP’s based on info score and minor allele frequency, I think I’ll have about 1 million SNP’s per chromosome.

When I create my cluster, how should I determine the most cost effective number of pre-emptible workers? And when I use the import_bgen function to import the BGEN file, how many partitions would be optimal?

Any tips would be greatly appreciated. Thanks!

tpoterba · August 29, 2019, 7:12pm

Probably best to leave the partitions as default. Don’t write to a matrix table – just work off the BGEN.

I would recommend building your SNP inclusion list, writing that as a table, and then using that to filter the BGEN each time you use it (rather than writing an intermediate matrix table, or computing info score every time).

I think you should be fine with around 500 worker cores (50-70 workers of default size)

Mark · August 29, 2019, 10:30pm

Thank you! That’s very helpful. Below is a simplified version of the code I’ve written starting with a table of variants called “variants”. Does it seem reasonable?

# load the SNP's from the .bgen file
mt = hl.import_bgen("data.bgen", entry_fields=['GP'], sample_file="sample", variants=variants)

# load the phenotypes
phenos = hl.import_table("phenotypes.txt", impute=True, key='id')

# add phenotype columns to MatrixTable mt
mt = mt.annotate_cols(pheno = phenos[mt.s])

# filter out people without a phenotype
mt = mt.filter_cols(mt.pheno.phenotype > 0)

# GWAS (recessive analysis)
gwas = hl.logistic_regression_rows(y=mt.pheno.phenotype, x=mt.GP[2], test='wald',covariates=[1.0,mt.pheno.age,mt.pheno.sex])

# export
gwas.export("gwas_results.tsv.bgz")

tpoterba · August 29, 2019, 11:16pm

Yes, this seems pretty optimal. It’ll still be somewhat expensive – logistic regression requires a lot of compute.

Topic		Replies	Views
What google cluster parameters for moderate scale wgs work? Hail Query & hailctl	1	586	May 28, 2019
Performance after MatrixTable filtering (repartition question) Hail Query & hailctl	7	1734	December 20, 2018
[Breaking Change] Hail 0.2 import_bgen should be passed a min_partitions parameter Updates	0	962	July 12, 2018
Most efficient way to analyse a large dataset split by chromosome (like UK Biobank) in Hail 0.2? Help [0.1]	4	882	October 18, 2018
How to run GWAS from UK Biobank efficiently on Hail Hail Query & hailctl	11	3377	December 21, 2020

Optimizing partitions and workers for UKBB analysis

Related topics