Generating index files vs. using pre-generated index for BGEN

I am trying to run a GWAS on UK Biobank data, as it seems many folks here are doing.

From UKB, I have a unique .bgen file for each chromosome, as well as .sample files (same for CHR 1-22), and unique .bgen.bgi index files for each chromosome.

If I am trying to load in CHR22, I tried:

mt = hl.import_bgen(bucket+‘/ukb_imp_chr22_v3.bgen’,
entry_fields=[‘GT’],
sample_file=bucket+‘/ukb48065_imp_chr22_v3_s487296.sample’)

and I get the error “FatalError: HailException: The following BGEN files have no .idx2 index file. Use ‘index_bgen’ to create the index file once before calling ‘import_bgen’”

Do I need run index_bgen() for every chromosome before I make MatrixTables, or can I use the .bgi files I already have? I thought the “index_file_map” optional parameter might be helpful, but it says in the documentation that the files must have an .idx2 extension. So does that mean it is not possible to use my index files, and I should just generate new ones?

Furthermore, if I am eventually to create a MatrixTable containing all contigs, can I use Hadoop glob patterns for the bgen and sample files in both index_bgen() and import_bgen()? Or do I need individual index_bgen() runs for each chromosome? Thanks!

Hi @acererak, sorry you’re running into trouble using Hail! I think we can quickly get you moving again.

Yes, but index_bgen accepts a list of files.

No, Hail uses its own index format that enables more powerful operations.

You can use the Hadoop glob patterns from both index_bgen and import_bgen. People often just construct a list of the files. This ensures they appear in the normal order: 1, 2, …, 22, X, Y, MT. If they’re not in the correct order, Hail has to (automatically) do a little bit of work to fix that.

The sample files do not need an index. It seems a bit odd to me that you have 22 samples files that are all the same. If you’re certain they’re all the same, just pick one and use that one. Here’s the code:

import hail as hl
files = [f'/ukb_imp_chr{contig}_v3.bgen' for contig in list(range(1, 23)) + ['X', 'Y', 'MT']]
hl.index_bgen(files)
mt = hl.import_bgen(
    files,
    fields=['GT'],
    sample_file=bucket+’/ukb48065_imp_chr22_v3_s487296.sample’)

Thanks for the explanation and code snippet. This is great!

Re the sample files: Per UK Biobank Resource 664 (https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=664):
“”"
Note that many of the Link files have the same contents for different anonymous files and hence only a single instance needs to be downloaded. Specifically:

The fam file is identical for all chromosomes and genotype data formats.
The sample file is identical for chromosomes 1-22 in the imputed data.

“”"
So I literally downloaded the sample files for chr1 and copied it for 1-22, and then just downloaded the sample files for X and XY.

Does sample_file take a list of files as well? If not I may need to make a separate MatrixTable for X and XY, and then merge it in…

Curious.

The sample_file parameter does not accept a list. When Neale Lab did the GWAS of all the phenotypes, they treated the autosomes, x, and y each as separate regressions: https://github.com/Nealelab/UK_Biobank_GWAS/blob/8f8ee456fdd044ce6809bb7e7492dc98fd2df42f/0.2/run_regressions.biomarkers.py

Sounds good, I will likely do the same.

import_bgen expects a list of files that correspond to genomic chunks with the same samples, so it doesn’t make sense to have many sample files. If you were importing many files, each with a different sample file, you’d have to get back a list of MTs, not just one!

@danking If I run index_bgen() as you suggest, with the input as a list of files, can I also use glob or list for the contig_recoding?

I have 01 instead of 1, 02 in stead of 2, which throws a HailException: Invalid locus

Ah, I see it takes a dict of str:str. That should work!