Best way to handle multiple large BGEN files for GWAS

I’m doing GWAS using UK Biobank data. Each chromosome has its own BGEN file. To reduce the amount of computation, I am only loading certain variants from each BGEN file, and I have a separate file of variants for each chromosome.

Would it be more computationally efficient to run each chromosome separately and then combine the results at the end, or should I try to load all the BGEN files at once?

When I run each chromosome separately, it seems that a few partitions always lag at the end of the computation, resulting in inefficient use of computational resources.

Can the “import_bgen” function handle multiple BGEN files and multiple variant lists at the same time (i.e. using the n-th variant list to select certain loci from the n-th BGEN file)? I could combine all the variants into one list if necessary.

And if importing all the BGEN files at once would work, would the enormous size of the resulting matrix table cause problems?


Probably better to import everything with one import_bgen call and one big variants list (it doesn’t support a list per file, not that that would be any more efficient, I think).


One more question.

I’m trying to combine the variants into one file. I have multiple tables, one for each chromosome, and they all have exactly the same column names. I want to combine the tables by putting one on top of each other, the equivalent of “rbind” in R, but I can’t seem to figure out how to do this in Hail. When I try to use table.join(), it keeps renaming some of the columns, saying the following:

Hail: INFO: Table.join: renamed the following fields on the right to avoid name conflicts:
‘rsid’ -> ‘rsid_1’
‘pos’ -> ‘pos_1’
‘ref’ -> ‘ref_1’
‘alt’ -> ‘alt_1’
‘maf’ -> ‘maf_1’
‘f6’ -> ‘f6_1’
‘info’ -> ‘info_1’

It seems like there should be a simple way to add one table on top of another, but I can’t seem to figure it out.

You’re thinking of Table.union. Join is a SQL Join, it’s for connecting two related dataset by a shared set of fields.

As I’m running the large GWAS, I’m monitoring it’s progress using the SparkUI. It seemed that things were going well for the first hour, but now the speed at which tasks are being successfully completed is falling and the number of failed tasks is starting to rise more quickly.

Active Jobs (1)
Job Id ▾ Description Submitted Duration Stages: Succeeded/Total Tasks (for all stages): Succeeded/Total
38 runJob at SparkHadoopWriter.scala:78 (kill) runJob at SparkHadoopWriter.scala:78 2019/09/09 14:20:26 1.9 h 1/2
7669/30014 (12308 failed)

I’m looking at the number of failed tasks, and it has risen to 12,308 fairly quickly, and the number of “succeeded” tasks has started rising more slowly.

Any ideas what could be causing this? I don’t want to run a large cluster for too long, and progress seems to have slowed.

we’d need to see the hail log file

How do I access that file?


Use scp to copy it off the leader node of your cluster.

I apologize for my poor programming skills.

If my cluster is called moconnor and the master node is moconnor-m, how would the syntax look?

scp mark@mconnor-m:home/hail/hail-20190909-1354-0.2.18-08ec699f0fd4.log /home/mark/hail.log

It tells me that “Permission denied (publickey).”

something like:

gcloud compute scp mconnor-m:/home/hail/...log /local/path/here

Thanks! That works, but the log is 61Mb, and I can’t really interpret what is going on. Maybe I should try to come to office hours on Thursday.

Yeah, the log is confusing. You can email it to us and we can look through it. Office hours would be good too.