I am trying to use Hail to do a GWAS on UK Biobank data. The data is in bgen files for each chromosome, and they are each between 50 and 200 GB, so I am trying to convert them to a .mt file.
I am working on the Harvard FASRC cluster using 480 cores.
I am running the following command to load in the BGEN files and merge them (I previously tried default n_partitions but that was also running too slowly):
mts = hl.import_bgen(DIR+'ukb_imp_chr[1-22]_v3.bgen', entry_fields=['GP'], n_partitions=480) mts.write('ukb_merged_bgen.mt', overwrite = True)
The job seems to be running extremely slowly. After 32 hours it is only on [Stage 0, (11 + 1) / 481], and with default n_partitions it was at about 400/2955 after 3 days.
I have 2 questions:
- How can I run this job faster?
- What do the stages and numbers represent?
I have also attached the hail log files for this job.
Thank you so much!
hail-20210628-0658-0.2.67-a673309b0445.log (63.4 KB)