GWAS on subset of UKBioBank

Hail syntax for this is:

filter_cols(hl.literal(subjs).contains(genotype_info.s))
1 Like

Hi there. I’ve read in all of the bgen files and subset them using the above command. Now I just want to count the number of samples in the subset to make sure that I have all of the samples.

However, counting the subset matrix table is taking a really long time.

genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile)
genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))

This works very quickly:

genotype_info.count()

This is taking a really really long time:

genotype_info_subset.count()

Should I be overwriting genotype_info? Maybe Hail did some indexing when I created the matrix table and it needs to reindex after I subset?

I’m reading in my idpsFile and I’m getting:

2021-04-08 11:37:31 Hail: INFO: Ordering unsorted dataset with network shuffle

What does this mean? The idpsFile is a 11,533 x 79,800 csv of floats. Here is an example entry: -0.820223875951066

It is essential that I keep the same ordering of this matrix. The rows and columns mean something very specific (rows are subjects, columns are brain functional connectivity edges). Please let me know if there is a better way to do this. Here is how I was thinking of setting this up:

  • 11,533 samples. I have a text file of these samples (subject IDs) that I read in so that I can subset genotype_info by s. The rows in idpsFile correspond to the order of the sample text file (how would I “tag” the idpsFile matrix table so that the rows are associated with the sample IDs)?
  • I read in the 22 chromosomes and subset to only include my samples (see my post above)
  • Then I parallelize 79,800 GWASs, using the columns of the idpsFile to regress onto the snps.

This is probably a Hail performance bug wherein it doesn’t realize it need not look at the genotype data. genotype_info_subset.count_cols() should be fast.

Does just the import-and-write operation cause this error? Concretely, does running only this show that message? If it does, that’s a Hail bug. Can you share a simplified file that triggers this?

import hail as hl
pheno_mt = hl.import_matrix_table(idpsFile, no_header=True)
pheno_mt.write('some_path.mt')

Thanks for your response.

Yes, count_cols() is much faster. Thank you.

Yes, just reading and writing causes the error. However, I also have to specify the dtype. If I do not I get:

Error summary: MatrixParseError: file:/data/Ukbb_DSST/DSST/dmoracze/gwas/head_idps.csv:2-3, invalid character '.' in integer literal

If I try:

pheno_mt = hl.import_matrix_table(idpsFile, no_header=True,entry_type=hl.tstr)

I get:

but it doesn’t error out. However the data are not strings, they are float (eg -0.820223875951066), but it I try entry_type=hl.tfloat64 I get a really long error beginning with:

FatalError: NumberFormatException: For input string:

In bash, I used head to grab to top of the file to see if it would recreate the error. It doesn’t but it gives me:

2021-04-08 14:40:52 Hail: INFO: Coerced sorted dataset

The full file is 16GB in size. Do you think if I rounded those floats Hail would be happier?
I have attached the first 10 rows and 10 columns from the full idpsFile here.

samp.txt (1.8 KB)

This command should not trigger that message because it is not an action (write, show, collect, etc.). Some subsequent operation is causing that message. What is the full series of commands that triggers that error? What version of Hail are you using? Latest is currently 0.2.64.

Can you try this instead?

import hail as hl
mt = hl.import_matrix_table('samp.txt', 
                            no_header=True,
                            entry_type=hl.tfloat,
                            delimiter=',')
mt.show()

If you get the number format error again, paste the whole thing here or at some paste service (like gist.github.com) and share the link here and I can dig in further.

EDIT: Do you have missing values? How are they represented? hl.import_matrix_table uses the string NA by default, but you can override that.

I’m using Hail 0.2.45.

Sorry, I wasn’t clear in the above message and it appears that I get the warning when I write the matrix table, not read it.

import hail as hl
mt = hl.import_matrix_table('samp.txt', 
                            no_header=True,
                            entry_type=hl.tfloat,
                            delimiter=',')

This works, no errors or warnings, but mt.show() gives me a whole host of errors. This is the bottom, I can paste the hundreds of lines in the error if you want.

Hail version: 0.2.45-a45a43f21e83
Error summary: SparkException: Failed to get broadcast_5_piece0 of broadcast_5

When I write the matrix table, I get the warning:

mt.write('samp.mt')

2021-04-08 15:29:41 Hail: INFO: Coerced sorted dataset
2021-04-08 15:29:41 Hail: INFO: wrote matrix table with 10 rows and 10 columns in 1 partition to samp.mt

You should update to hail 0.2.64. You’re missing out on 9 extra months of development, it’s possible things have improved and some of your problems have been addressed.

Oof! That’s almost a year old. Would you mind updating to the latest version of Hail and trying again?

The broadcast issue is perplexing, I’d need the complete stack trace to address that. It looks likely to be some spark configuration issue.

EDIT: it looks like the write was successful anyway.

Thanks. I need to put this project down for today. I’ll update and try tomorrow.

Yes, I just want to make sure that the order in which it was written was the order it was read because the row order is tied to the samples. So I wanted to know whether are ordering or sorting was happening.

I’m quite confident this snippet changes neither the row nor column ordering:

import hail as hl
mt = hl.import_matrix_table('samp.txt', 
                            no_header=True,
                            entry_type=hl.tfloat,
                            delimiter=',')
mt.write('samp.mt')

If it does, that’d be a serious error in Hail that I’d like to investigate further.

Hello - I’m trying to incorporate the method outlined by @tpoterba above:

  • Save pheno data as a matrix table
  • Read in genotype data
  • Subset Genotype data, save as matrix table
  • merge pheno with subsetted pheno

However saving the subset genotype data as a matrix table is taking a really long time.

genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile)
genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))
genotype_info_subset.write('geno_subset.mt')

bgen22Files is a list of all the bgen files for the 22 chromosomes. I’m subsetting by subjs which is a list of 11,533 sample IDs. The genotype_info_subset.write('geno_subset.mt') call has been running for almost 24 hours.

Is it necessary to save the subset genotype data as a matrix table? Or should I just merge the pheno and the geno data and move on, not saving it?

Another question that I’m going to have: how should I optimize the gwas to run on my HPC? I’ve got 79,800 gwases to run. Should I manually distribute these jobs using our job scheduler or is @tpoterba’s suggestion of working in batches the most efficient way to go about this?

The UKB BGEN files are very large and selecting out 11k of 500k columns is a time-consuming operation. It is not necessary to save the genotype data as a matrix table. BGEN an efficient file format, I recommend using it directly.

I don’t think there’s any optimization that you can do. Hail could be faster (we’re working on it!), but Hail was primarily designed to scale, on a Spark cluster, to many cores. That’s how most folks use Hail on the UKB: they use a very large number of cores for a short period of time to quickly finish an analysis. Because Hail scales well, the cost is roughly the same whether you use 1 core for 10,000 hours or 10,000 cores for 1 hour.

EDIT: 79,800 GWASes on every variant in the BGEN for 11,000 is a tremendous amount of computation. The Neale Lab ran ~2k GWASes, on every variant, on 337,000 individuals, and it took a couple days with thousands of cores. The analysis you want to do is certainly possible, but I think only in an environment where you can lease thousands of cores for a couple of days.

1 Like

Thanks for the info! I’m not worried too much about compute time, I just want to make sure I’m preparing in the most efficient way as possible. I’ll be working on the Biowulf cluster at the NIH.

EDIT: Once I know the most efficient way to use Hail, then I’ll benchmark the jobs and I can work with the cluster support staff to efficiently request resources.

I’m having an issue in my process of subsetting and merging the genotype and phenotype data.

# phenotype
# 11533 x 79800 csv
pheno_mt = hl.import_matrix_table(idpsFile, no_header=True, entry_type=hl.tfloat, delimiter=',')
subjs # list of 11533 strings of subject IDs (samples)

# genotype
genotype_info = hl.import_bgen(bgen22Files, entry_fields=['dosage'], sample_file=sampleFile)
genotype_info_subset = genotype_info.filter_cols(hl.literal(subjs).contains(genotype_info.s))

When I subset genotype_info based on subjs, only 11224 samples are returned. I take it this means I don’t have genotype data for some of the subjects I have phenotype data for. That’s fine.

However, I want to merge these by sample / s / subjs. Since subjs and pheno_mt are separate I’m not sure how to perform this merge. Ideally I’d have the row_key for pheno_mt be subjs.

I’ve tried to use genotype_info_subset.s to subset subj

subjs[genotype_info_subset.s]
TypeError: list indices must be integers or slices, not StringExpression

I’ve tried to annotate columns of pheno_mt with subjs but this puts the entire subject list at each row. And I’ve tried to replace the row key of pheno_mt with pheno_mt.key_cols_by(subjs):

TypeError: key_cols_by: parameter '*keys' (arg 0 of 1): expected (str or hail.expr.expressions.base_expression.Expression), found list:

@danking since I’m wanting to run 79,800 separate gwases, distributed across our cluster. Which approach do you think is more effeicient?

  1. Merge the geno matrix table with the full pheno table, then write a script to run a gwas on a single annotation (eg a single column from the pheno table)

or

  1. Write a script that subsets the pheno table for each iteration before merging pheno and geno? Then run the gwas.

Either way I’ll need a way to preserve the pheno column index, so I’ll likely write a script that takes a column index as an argument.

I think you want option 3 –

  • Divide phenotypes into groups of BATCH_SIZE (try 16 or 32 to start).
  • For each batch, read pheno table, select phenos in current batch, annotate_cols to join onto geno MT.
  • Run linear_regression_rows on the batch of phenotypes at the same time, which can take advantage of BLAS3 routines that save a lot of time over multiple calls to BLAS2 routines.
  • export results for this batch to a file
1 Like

Ah good call, thanks.
You suggested this earlier in the thread, sorry to make you repeat yourself!

@tpoterba your method seems to be working well, though saving the results takes a really really long time. Is that to be expected?

batch_results = hl.linear_regression_rows(
    [geno_subset_mt.phenos[i] for i in range(batch_size)],
    geno_subset_mt.dosage,
    covariates=[1.0],
    block_size=128)

batch_results.write('test') # takes a really long time

Yes, absolutely! Hail’s interfaces are lazy; no computational work is happening until your write line. The time for the write includes the time for the regressions (which is the majority of the work).

1 Like