Best way to handle multiple large BGEN files for GWAS

Mark · September 6, 2019, 1:43am

I’m doing GWAS using UK Biobank data. Each chromosome has its own BGEN file. To reduce the amount of computation, I am only loading certain variants from each BGEN file, and I have a separate file of variants for each chromosome.

Would it be more computationally efficient to run each chromosome separately and then combine the results at the end, or should I try to load all the BGEN files at once?

When I run each chromosome separately, it seems that a few partitions always lag at the end of the computation, resulting in inefficient use of computational resources.

Can the “import_bgen” function handle multiple BGEN files and multiple variant lists at the same time (i.e. using the n-th variant list to select certain loci from the n-th BGEN file)? I could combine all the variants into one list if necessary.

And if importing all the BGEN files at once would work, would the enormous size of the resulting matrix table cause problems?

Thanks!

tpoterba · September 6, 2019, 2:03am

Probably better to import everything with one import_bgen call and one big variants list (it doesn’t support a list per file, not that that would be any more efficient, I think).

Mark · September 6, 2019, 1:43pm

Thanks!

One more question.

I’m trying to combine the variants into one file. I have multiple tables, one for each chromosome, and they all have exactly the same column names. I want to combine the tables by putting one on top of each other, the equivalent of “rbind” in R, but I can’t seem to figure out how to do this in Hail. When I try to use table.join(), it keeps renaming some of the columns, saying the following:

Hail: INFO: Table.join: renamed the following fields on the right to avoid name conflicts:
‘rsid’ -> ‘rsid_1’
‘pos’ -> ‘pos_1’
‘ref’ -> ‘ref_1’
‘alt’ -> ‘alt_1’
‘maf’ -> ‘maf_1’
‘f6’ -> ‘f6_1’
‘info’ -> ‘info_1’

It seems like there should be a simple way to add one table on top of another, but I can’t seem to figure it out.

danking · September 6, 2019, 2:07pm

You’re thinking of Table.union. Join is a SQL Join, it’s for connecting two related dataset by a shared set of fields.

Mark · September 9, 2019, 4:22pm

As I’m running the large GWAS, I’m monitoring it’s progress using the SparkUI. It seemed that things were going well for the first hour, but now the speed at which tasks are being successfully completed is falling and the number of failed tasks is starting to rise more quickly.

Active Jobs (1)
Job Id ▾ Description Submitted Duration Stages: Succeeded/Total Tasks (for all stages): Succeeded/Total
38 runJob at SparkHadoopWriter.scala:78 (kill) runJob at SparkHadoopWriter.scala:78 2019/09/09 14:20:26 1.9 h 1/2
7669/30014 (12308 failed)

I’m looking at the number of failed tasks, and it has risen to 12,308 fairly quickly, and the number of “succeeded” tasks has started rising more slowly.

Any ideas what could be causing this? I don’t want to run a large cluster for too long, and progress seems to have slowed.

tpoterba · September 9, 2019, 4:24pm

we’d need to see the hail log file

Mark · September 9, 2019, 4:26pm

How do I access that file?

/home/hail/hail-20190909-1354-0.2.18-08ec699f0fd4.log

danking · September 9, 2019, 4:27pm

Use scp to copy it off the leader node of your cluster.

Mark · September 9, 2019, 4:36pm

I apologize for my poor programming skills.

If my cluster is called moconnor and the master node is moconnor-m, how would the syntax look?

scp mark@mconnor-m:home/hail/hail-20190909-1354-0.2.18-08ec699f0fd4.log /home/mark/hail.log

It tells me that “Permission denied (publickey).”

tpoterba · September 9, 2019, 4:37pm

something like:

gcloud compute scp mconnor-m:/home/hail/...log /local/path/here

Mark · September 9, 2019, 4:53pm

Thanks! That works, but the log is 61Mb, and I can’t really interpret what is going on. Maybe I should try to come to office hours on Thursday.

johnc1231 · September 9, 2019, 5:33pm

Yeah, the log is confusing. You can email it to us and we can look through it. Office hours would be good too.

Topic		Replies	Views
Importing large BGEN into Hail Matrix Table Hail Query & hailctl	4	484	July 2, 2021
GWAS on subset of UKBioBank Hail Query & hailctl	26	1574	July 13, 2021
Best practices for UK Biobank Imputed Data Hail Query & hailctl	9	1378	March 19, 2022
Subsetting a BGEN file in hail Hail Query & hailctl	1	51	October 4, 2024
Unable to do sample/variant QC after combining MatrixTable Hail Query & hailctl	11	419	January 8, 2023

Best way to handle multiple large BGEN files for GWAS

Related topics