Most efficient way to analyse a large dataset split by chromosome (like UK Biobank) in Hail 0.2?

hhx037 · October 16, 2018, 4:33pm

Sorry, me again

I’m wondering what the most efficient way of analysing (logreg) a set such as the UK BIobank, which is divided by chromosome, is. Is a simple loop the way to go, or is there some method that would somehow be more efficient? How did you proceed? (My resources are far more limited, with only 46 cores available)

Any tip that can speed up the regression/saving results/exporting p-values? (gwas.rows().cache() ?)

tpoterba · October 16, 2018, 5:16pm

You can load many BGENs at once with import_bgen – this is probably the easiest thing.

you also won’t need the cache there, since this is a one-pass algorithm.

hhx037 · October 16, 2018, 7:32pm

Great, thank you, given the size of the data and the relatively little resources I’ve got, I want to optimize as much as I can.

Would the multiple import also work for vcf files?

tpoterba · October 16, 2018, 7:58pm

yes, it does!

hhx037 · October 18, 2018, 10:06am

Fantastic, thanks again

Topic		Replies	Views
Optimizing partitions and workers for UKBB analysis Hail Query & hailctl	3	455	August 29, 2019
Resource -> runtime question for large datasets Help [0.1]	13	1415	September 27, 2018
Requesting advice on efficiently parsing through many GWAS results Hail Query & hailctl	8	518	June 14, 2022
GWAS on subset of UKBioBank Hail Query & hailctl	26	1321	July 13, 2021
How to run GWAS from UK Biobank efficiently on Hail Hail Query & hailctl	11	2751	December 21, 2020

Most efficient way to analyse a large dataset split by chromosome (like UK Biobank) in Hail 0.2?

Related Topics