Most efficient way to analyse a large dataset split by chromosome (like UK Biobank) in Hail 0.2?

hhx037 · October 16, 2018, 4:33pm

Sorry, me again

I’m wondering what the most efficient way of analysing (logreg) a set such as the UK BIobank, which is divided by chromosome, is. Is a simple loop the way to go, or is there some method that would somehow be more efficient? How did you proceed? (My resources are far more limited, with only 46 cores available)

Any tip that can speed up the regression/saving results/exporting p-values? (gwas.rows().cache() ?)

tpoterba · October 16, 2018, 5:16pm

You can load many BGENs at once with import_bgen – this is probably the easiest thing.

you also won’t need the cache there, since this is a one-pass algorithm.

hhx037 · October 16, 2018, 7:32pm

Great, thank you, given the size of the data and the relatively little resources I’ve got, I want to optimize as much as I can.

Would the multiple import also work for vcf files?

tpoterba · October 16, 2018, 7:58pm

yes, it does!

hhx037 · October 18, 2018, 10:06am

Fantastic, thanks again

Topic		Replies	Views
How to run GWAS from UK Biobank efficiently on Hail Hail Query & hailctl	11	3318	December 21, 2020
Best way to handle multiple large BGEN files for GWAS Hail Query & hailctl	11	1475	September 9, 2019
Best practices for UK Biobank Imputed Data Hail Query & hailctl	9	1378	March 19, 2022
GWAS on subset of UKBioBank Hail Query & hailctl	26	1573	July 13, 2021
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1911	August 23, 2024

Most efficient way to analyse a large dataset split by chromosome (like UK Biobank) in Hail 0.2?

Related topics