I’m doing GWAS using UK Biobank data. Each chromosome has its own BGEN file. To reduce the amount of computation, I am only loading certain variants from each BGEN file, and I have a separate file of variants for each chromosome.
Would it be more computationally efficient to run each chromosome separately and then combine the results at the end, or should I try to load all the BGEN files at once?
When I run each chromosome separately, it seems that a few partitions always lag at the end of the computation, resulting in inefficient use of computational resources.
Can the “import_bgen” function handle multiple BGEN files and multiple variant lists at the same time (i.e. using the n-th variant list to select certain loci from the n-th BGEN file)? I could combine all the variants into one list if necessary.
And if importing all the BGEN files at once would work, would the enormous size of the resulting matrix table cause problems?
Probably better to import everything with one import_bgen call and one big variants list (it doesn’t support a list per file, not that that would be any more efficient, I think).
I’m trying to combine the variants into one file. I have multiple tables, one for each chromosome, and they all have exactly the same column names. I want to combine the tables by putting one on top of each other, the equivalent of “rbind” in R, but I can’t seem to figure out how to do this in Hail. When I try to use table.join(), it keeps renaming some of the columns, saying the following:
Hail: INFO: Table.join: renamed the following fields on the right to avoid name conflicts:
‘rsid’ -> ‘rsid_1’
‘pos’ -> ‘pos_1’
‘ref’ -> ‘ref_1’
‘alt’ -> ‘alt_1’
‘maf’ -> ‘maf_1’
‘f6’ -> ‘f6_1’
‘info’ -> ‘info_1’
It seems like there should be a simple way to add one table on top of another, but I can’t seem to figure it out.
As I’m running the large GWAS, I’m monitoring it’s progress using the SparkUI. It seemed that things were going well for the first hour, but now the speed at which tasks are being successfully completed is falling and the number of failed tasks is starting to rise more quickly.
Active Jobs (1)
Job Id ▾ Description Submitted Duration Stages: Succeeded/Total Tasks (for all stages): Succeeded/Total
38 runJob at SparkHadoopWriter.scala:78 (kill) runJob at SparkHadoopWriter.scala:78 2019/09/09 14:20:26 1.9 h 1/2
7669/30014 (12308 failed)
I’m looking at the number of failed tasks, and it has risen to 12,308 fairly quickly, and the number of “succeeded” tasks has started rising more slowly.
Any ideas what could be causing this? I don’t want to run a large cluster for too long, and progress seems to have slowed.