I am trying to run a GWAS on UK Biobank data, as it seems many folks here are doing.
From UKB, I have a unique .bgen file for each chromosome, as well as .sample files (same for CHR 1-22), and unique .bgen.bgi index files for each chromosome.
and I get the error “FatalError: HailException: The following BGEN files have no .idx2 index file. Use ‘index_bgen’ to create the index file once before calling ‘import_bgen’”
Do I need run index_bgen() for every chromosome before I make MatrixTables, or can I use the .bgi files I already have? I thought the “index_file_map” optional parameter might be helpful, but it says in the documentation that the files must have an .idx2 extension. So does that mean it is not possible to use my index files, and I should just generate new ones?
Furthermore, if I am eventually to create a MatrixTable containing all contigs, can I use Hadoop glob patterns for the bgen and sample files in both index_bgen() and import_bgen()? Or do I need individual index_bgen() runs for each chromosome? Thanks!
No, Hail uses its own index format that enables more powerful operations.
You can use the Hadoop glob patterns from both index_bgen and import_bgen. People often just construct a list of the files. This ensures they appear in the normal order: 1, 2, …, 22, X, Y, MT. If they’re not in the correct order, Hail has to (automatically) do a little bit of work to fix that.
The sample files do not need an index. It seems a bit odd to me that you have 22 samples files that are all the same. If you’re certain they’re all the same, just pick one and use that one. Here’s the code:
import hail as hl
files = [f'/ukb_imp_chr{contig}_v3.bgen' for contig in list(range(1, 23)) + ['X', 'Y', 'MT']]
hl.index_bgen(files)
mt = hl.import_bgen(
files,
fields=['GT'],
sample_file=bucket+’/ukb48065_imp_chr22_v3_s487296.sample’)
Thanks for the explanation and code snippet. This is great!
Re the sample files: Per UK Biobank Resource 664 (https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=664):
“”"
Note that many of the Link files have the same contents for different anonymous files and hence only a single instance needs to be downloaded. Specifically:
The fam file is identical for all chromosomes and genotype data formats.
The sample file is identical for chromosomes 1-22 in the imputed data.
“”"
So I literally downloaded the sample files for chr1 and copied it for 1-22, and then just downloaded the sample files for X and XY.
Does sample_file take a list of files as well? If not I may need to make a separate MatrixTable for X and XY, and then merge it in…
import_bgen expects a list of files that correspond to genomic chunks with the same samples, so it doesn’t make sense to have many sample files. If you were importing many files, each with a different sample file, you’d have to get back a list of MTs, not just one!