Importing large BGEN into Hail Matrix Table

Hi,

I am trying to use Hail to do a GWAS on UK Biobank data. The data is in bgen files for each chromosome, and they are each between 50 and 200 GB, so I am trying to convert them to a .mt file.

I am working on the Harvard FASRC cluster using 480 cores.
I am running the following command to load in the BGEN files and merge them (I previously tried default n_partitions but that was also running too slowly):

mts = hl.import_bgen(DIR+'ukb_imp_chr[1-22]_v3.bgen', entry_fields=['GP'], n_partitions=480)
mts.write('ukb_merged_bgen.mt', overwrite = True)

The job seems to be running extremely slowly. After 32 hours it is only on [Stage 0, (11 + 1) / 481], and with default n_partitions it was at about 400/2955 after 3 days.

I have 2 questions:

  1. How can I run this job faster?
  2. What do the stages and numbers represent?

I have also attached the hail log files for this job.

Thank you so much!
hail-20210628-0658-0.2.67-a673309b0445.log (63.4 KB)

1 Like

In the above (11 + 1) / 481, there are 481 total “tasks” (parallelizable units of work), 11 are finished, and 1 is in progress.

The first problem is that your pipeline is only running on a single CPU, not 480. You’re running on a Spark cluster with 480 cores? How is it configured?

Once we solve the parallelism, there will be another issue – you probably shouldn’t write BGEN data directly to a matrix table. BGEN has excellent representation/compression of dosage data, and while we’re working on adding those representations and compression strategies as general-purpose infrastructure in Hail, we don’t have it yet. You should work directly from the BGEN files – import is relatively speedy, I think this query is slow because Hail is writing MUCH more data than the sum size of the BGENs.

Thank you so much! I reached out to the help email for the Harvard FAS RC to see if they have an idea why the pipeline is only running on a single CPU. I will let you know if we have trouble figuring it out. Reading directly from the BGEN should also help!

Thanks!
Lillian

Thanks again for your help.
I am running the job on the Harvard FAS RC cluster. The cluster uses SLURM through which I can request multiple cores on one or more nodes. I checked and even with only one node, it doesn’t seem to run tasks in parallel.

We have hail installed through a singularity container which is set up in the following way:

singularity build --remote hail_0.2.67.sif hail_0.2.67.def
Bootstrap: docker
From: hailgenetics/hail:0.2.67

# enable hail plotting functionality

%post
    # ubuntu 20.04 phantomjs package broken; use static binary
    # https://askubuntu.com/a/1163268
    # https://github.com/openstreetmap/openstreetmap-website/issues/2569
    curl -L https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 |
      tar -xjvf -
    mv phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin
    rm -rf phantomjs-2.1.1-linux-x86_64
    hail-pip-install --no-cache-dir phantomjs==1.4.1 selenium==3.141.0

I am not sure if this singularity image is correct. Do you see anything wrong with it / know how to make it run in parallel?

Thanks!

I talked with one of my coworkers who has a lot of spark experience and this is what we deducted.

We think Spark is most likely running in standalone mode, because Hail is working but not running in parallel. Is there a spark context we can specify at the beginning of the python script to change the spark mode, or is there a command we can add to the hail init?

This is my current init command:
hl.init(spark_conf={'spark.driver.memory': '100g')

Thanks again, and have a good long weekend!