GWAS on subset of UKBioBank

Hello, I’m using this thread as a starting point.

The above workflow shows how to run a gwas on one chromosome using the ukbiobank bgen file.

My ultimate goal is to run a full gwas on 79,800 phenotypic measures. In this case, functional connectivity from resting state MRI. I have a subset of 11,533 subjects that I’d like to run this on. I could use some help in 1) setting up this analysis and 2) optimizing it.

To start, I’d like to get a gwas running on one pheno measure, which I could then parallelize 79,800 times.

The below code is from the above thread, which I’m using as a starting point, but I’ve changed it a bit for my own application.

import hail as hl
import pandas as pd
import os, sys
bgenFile = 'ukb_imp_chr1_v3.bgen'
sampleFile = 'ukb22875_imp_chr1_22_v3_s487320.sample'
mfiFile = 'ukb_mfi_chr1_v3.txt'
idpsFile = 'ALL_IDPs_i_deconf.csv'
subjsFile = 'good_subjs.csv'
initiate hail environment

# import pheno table
pheno = hl.import_table(idpsFile, delimiter=’,’)

My idpsFile is a 11,533 x 79,800 matrix csv (subjects by pheno measures). Only data, no column or row names. Do I need to include column and row names for indexing in the data file, or is this something I can annotate using Hail?

Also, what is best practice for running a gwas on all chromosomes? Run a regression separately for each? If not, how would I prepare the data for an all-chomosome run?

I’ll likely have more questions, but let’s start there. Thanks in advance for your help!

The rough strategy that will probably work best here is:

  1. prepare a matrix table with genotypes, and a matrix table with phenotypes (orientation of this second matrix doesn’t really matter)
  2. Using some batch size B, annotate B phenotypes from the phenotypes matrix onto the genotype matrix columns, then run linear_regression_rows against all B phenotypes together. There are optimizations inside of Hail’s linear_regression_rows when running on multiple phenotypes at the same time – we can use BLAS3 operations instead of BLAS2, which is much more efficient. Finally, write the linear regression results from this batch to a file.
  3. Repeat #2 for all batches.

Some specifics:

preparing phenotype matrix

You can absolutely use Hail for this. Something like:

pheno_mt = hl.import_matrix_table(idpsFile, no_header=True)

Running a batch will look something like:

batch_size = ...
batch_index = ...

phenos_mt = hl.read_matrix_table('')

pheno_start = batch_size * batch_index
pheno_end = pheno_start + batch_size

# rows are samples, cols are  phenotypes, right?
phenos_mt = phenos_mt.filter_cols((phenos_mt.col_id >= pheno_start) & (phenos_mt.col_id < pheno_end))
phenos_table = phenos_mt.annotate_rows(phenos = hl.agg.collect(phenos_mt.x)).rows()

genos_mt = hl.read_matrix_table('')
genos_mt = genos_mt.add_col_idx() # we are joining by index, right?
genos_mt = genos_mt.annotate_cols(phenos = phenos_table[genos_mt.col_idx].phenos)

batch_results = hl.linear_regression_rows(
    [genos_mt.phenos[i] for i in range(batch_size)],
    block_size=128, #you can use more than the default 16 because you only have 11k samples)

a caveat

This approach relies on the phenotypes being non-missing. If phenotypes are sometimes missing, you’ll need a slightly different strategy – the batched linear regression runs regression within a group on the set of samples that are non-missing for every phenotype and every covariate. If you have phenotypes with disparate missingness patterns, you can end up with no samples in that set!

Thank you so much for your quick reply! Your approach makes sense to me, but I have some follow-up questions.

Yes, rows are samples and columns are phenotypes.

I think we are joining by column index, but I’m still a little hazy on how the matrix table is organized after importing the bgen file.

I’m reading in the bgen file:

genotype_info = hl.import_bgen(path=bgenFile, entry_fields=['dosage'], sample_file=sampleFile)

I have a column field of s, which I think is the sample (subject ID). If this is true then yes, I want to merge pheno_mt and genos_mt by the s column. Do I need to filter genos_mt by the subject IDs before merging? If so, how do I do this?

Also, how do I prepare a matrix table from multiple bgen files?

Luckily the phenotype data have undergone extensive deconfounding and cleaning. So I can be sure that there are no missing data in the pheno file and I have no covariates since I have regressed them out prior to the gwas.

Ah, I see. You should do this before any of the batches, and write out a matrix table with just the samples of interest – then you don’t waste time reading and discarding the 490k others.

import_bgen takes a list of bgen arguments (I assume these are split by chromosome or something)

To answer the join question, do you have some mapping from sample ID in the bgen to position in the phenos matrix? Using that to do the join rather than the column index of the filtered matrix table is going to be much better.

Yes, I have a list of subject IDs. The IDs correspond to rows in the pheno file and entries in the s field of the bgen column matrix table. So I was planning on something like:

subjs = ... # list of subject IDs, as strings
genotype_info = hl.import_bgen(path=bgenFile, entry_fields=['dosage'], sample_file=sampleFile)

Yes, they are split by chromosome. I didn’t realize import_bgen could take multiple files. Cool. I’ll look into this and also how to index the full matrix table so that I can keep track of chromosomes downstream.

Hail syntax for this is:

1 Like

Hi there. I’ve read in all of the bgen files and subset them using the above command. Now I just want to count the number of samples in the subset to make sure that I have all of the samples.

However, counting the subset matrix table is taking a really long time.

genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile)
genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))

This works very quickly:


This is taking a really really long time:


Should I be overwriting genotype_info? Maybe Hail did some indexing when I created the matrix table and it needs to reindex after I subset?

I’m reading in my idpsFile and I’m getting:

2021-04-08 11:37:31 Hail: INFO: Ordering unsorted dataset with network shuffle

What does this mean? The idpsFile is a 11,533 x 79,800 csv of floats. Here is an example entry: -0.820223875951066

It is essential that I keep the same ordering of this matrix. The rows and columns mean something very specific (rows are subjects, columns are brain functional connectivity edges). Please let me know if there is a better way to do this. Here is how I was thinking of setting this up:

  • 11,533 samples. I have a text file of these samples (subject IDs) that I read in so that I can subset genotype_info by s. The rows in idpsFile correspond to the order of the sample text file (how would I “tag” the idpsFile matrix table so that the rows are associated with the sample IDs)?
  • I read in the 22 chromosomes and subset to only include my samples (see my post above)
  • Then I parallelize 79,800 GWASs, using the columns of the idpsFile to regress onto the snps.

This is probably a Hail performance bug wherein it doesn’t realize it need not look at the genotype data. genotype_info_subset.count_cols() should be fast.

Does just the import-and-write operation cause this error? Concretely, does running only this show that message? If it does, that’s a Hail bug. Can you share a simplified file that triggers this?

import hail as hl
pheno_mt = hl.import_matrix_table(idpsFile, no_header=True)

Thanks for your response.

Yes, count_cols() is much faster. Thank you.

Yes, just reading and writing causes the error. However, I also have to specify the dtype. If I do not I get:

Error summary: MatrixParseError: file:/data/Ukbb_DSST/DSST/dmoracze/gwas/head_idps.csv:2-3, invalid character '.' in integer literal

If I try:

pheno_mt = hl.import_matrix_table(idpsFile, no_header=True,entry_type=hl.tstr)

I get:

but it doesn’t error out. However the data are not strings, they are float (eg -0.820223875951066), but it I try entry_type=hl.tfloat64 I get a really long error beginning with:

FatalError: NumberFormatException: For input string:

In bash, I used head to grab to top of the file to see if it would recreate the error. It doesn’t but it gives me:

2021-04-08 14:40:52 Hail: INFO: Coerced sorted dataset

The full file is 16GB in size. Do you think if I rounded those floats Hail would be happier?
I have attached the first 10 rows and 10 columns from the full idpsFile here.

samp.txt (1.8 KB)

This command should not trigger that message because it is not an action (write, show, collect, etc.). Some subsequent operation is causing that message. What is the full series of commands that triggers that error? What version of Hail are you using? Latest is currently 0.2.64.

Can you try this instead?

import hail as hl
mt = hl.import_matrix_table('samp.txt', 

If you get the number format error again, paste the whole thing here or at some paste service (like and share the link here and I can dig in further.

EDIT: Do you have missing values? How are they represented? hl.import_matrix_table uses the string NA by default, but you can override that.

I’m using Hail 0.2.45.

Sorry, I wasn’t clear in the above message and it appears that I get the warning when I write the matrix table, not read it.

import hail as hl
mt = hl.import_matrix_table('samp.txt', 

This works, no errors or warnings, but gives me a whole host of errors. This is the bottom, I can paste the hundreds of lines in the error if you want.

Hail version: 0.2.45-a45a43f21e83
Error summary: SparkException: Failed to get broadcast_5_piece0 of broadcast_5

When I write the matrix table, I get the warning:


2021-04-08 15:29:41 Hail: INFO: Coerced sorted dataset
2021-04-08 15:29:41 Hail: INFO: wrote matrix table with 10 rows and 10 columns in 1 partition to

You should update to hail 0.2.64. You’re missing out on 9 extra months of development, it’s possible things have improved and some of your problems have been addressed.

Oof! That’s almost a year old. Would you mind updating to the latest version of Hail and trying again?

The broadcast issue is perplexing, I’d need the complete stack trace to address that. It looks likely to be some spark configuration issue.

EDIT: it looks like the write was successful anyway.

Thanks. I need to put this project down for today. I’ll update and try tomorrow.

Yes, I just want to make sure that the order in which it was written was the order it was read because the row order is tied to the samples. So I wanted to know whether are ordering or sorting was happening.

I’m quite confident this snippet changes neither the row nor column ordering:

import hail as hl
mt = hl.import_matrix_table('samp.txt', 

If it does, that’d be a serious error in Hail that I’d like to investigate further.

Hello - I’m trying to incorporate the method outlined by @tpoterba above:

  • Save pheno data as a matrix table
  • Read in genotype data
  • Subset Genotype data, save as matrix table
  • merge pheno with subsetted pheno

However saving the subset genotype data as a matrix table is taking a really long time.

genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile)
genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))

bgen22Files is a list of all the bgen files for the 22 chromosomes. I’m subsetting by subjs which is a list of 11,533 sample IDs. The genotype_info_subset.write('') call has been running for almost 24 hours.

Is it necessary to save the subset genotype data as a matrix table? Or should I just merge the pheno and the geno data and move on, not saving it?

Another question that I’m going to have: how should I optimize the gwas to run on my HPC? I’ve got 79,800 gwases to run. Should I manually distribute these jobs using our job scheduler or is @tpoterba’s suggestion of working in batches the most efficient way to go about this?

The UKB BGEN files are very large and selecting out 11k of 500k columns is a time-consuming operation. It is not necessary to save the genotype data as a matrix table. BGEN an efficient file format, I recommend using it directly.

I don’t think there’s any optimization that you can do. Hail could be faster (we’re working on it!), but Hail was primarily designed to scale, on a Spark cluster, to many cores. That’s how most folks use Hail on the UKB: they use a very large number of cores for a short period of time to quickly finish an analysis. Because Hail scales well, the cost is roughly the same whether you use 1 core for 10,000 hours or 10,000 cores for 1 hour.

EDIT: 79,800 GWASes on every variant in the BGEN for 11,000 is a tremendous amount of computation. The Neale Lab ran ~2k GWASes, on every variant, on 337,000 individuals, and it took a couple days with thousands of cores. The analysis you want to do is certainly possible, but I think only in an environment where you can lease thousands of cores for a couple of days.

1 Like