Most efficient way to analyse a large dataset split by chromosome (like UK Biobank) in Hail 0.2?


#1

Sorry, me again :smiley:

I’m wondering what the most efficient way of analysing (logreg) a set such as the UK BIobank, which is divided by chromosome, is. Is a simple loop the way to go, or is there some method that would somehow be more efficient? How did you proceed? (My resources are far more limited, with only 46 cores available)

Any tip that can speed up the regression/saving results/exporting p-values? (gwas.rows().cache() ?)


#2

You can load many BGENs at once with import_bgen – this is probably the easiest thing.

you also won’t need the cache there, since this is a one-pass algorithm.


#3

Great, thank you, given the size of the data and the relatively little resources I’ve got, I want to optimize as much as I can.

Would the multiple import also work for vcf files?


#4

yes, it does!


#5

Fantastic, thanks again :slight_smile: