Generating index files vs. using pre-generated index for BGEN

acererak · May 28, 2020, 9:06pm

I am trying to run a GWAS on UK Biobank data, as it seems many folks here are doing.

From UKB, I have a unique .bgen file for each chromosome, as well as .sample files (same for CHR 1-22), and unique .bgen.bgi index files for each chromosome.

If I am trying to load in CHR22, I tried:

mt = hl.import_bgen(bucket+‘/ukb_imp_chr22_v3.bgen’,
entry_fields=[‘GT’],
sample_file=bucket+‘/ukb48065_imp_chr22_v3_s487296.sample’)

and I get the error “FatalError: HailException: The following BGEN files have no .idx2 index file. Use ‘index_bgen’ to create the index file once before calling ‘import_bgen’”

Do I need run index_bgen() for every chromosome before I make MatrixTables, or can I use the .bgi files I already have? I thought the “index_file_map” optional parameter might be helpful, but it says in the documentation that the files must have an .idx2 extension. So does that mean it is not possible to use my index files, and I should just generate new ones?

Furthermore, if I am eventually to create a MatrixTable containing all contigs, can I use Hadoop glob patterns for the bgen and sample files in both index_bgen() and import_bgen()? Or do I need individual index_bgen() runs for each chromosome? Thanks!

danking · May 29, 2020, 1:34pm

Hi @acererak, sorry you’re running into trouble using Hail! I think we can quickly get you moving again.

Yes, but index_bgen accepts a list of files.

No, Hail uses its own index format that enables more powerful operations.

You can use the Hadoop glob patterns from both index_bgen and import_bgen. People often just construct a list of the files. This ensures they appear in the normal order: 1, 2, …, 22, X, Y, MT. If they’re not in the correct order, Hail has to (automatically) do a little bit of work to fix that.

The sample files do not need an index. It seems a bit odd to me that you have 22 samples files that are all the same. If you’re certain they’re all the same, just pick one and use that one. Here’s the code:

import hail as hl
files = [f'/ukb_imp_chr{contig}_v3.bgen' for contig in list(range(1, 23)) + ['X', 'Y', 'MT']]
hl.index_bgen(files)
mt = hl.import_bgen(
    files,
    fields=['GT'],
    sample_file=bucket+’/ukb48065_imp_chr22_v3_s487296.sample’)

acererak · May 29, 2020, 3:56pm

Thanks for the explanation and code snippet. This is great!

Re the sample files: Per UK Biobank Resource 664 (https://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=664):
“”"
Note that many of the Link files have the same contents for different anonymous files and hence only a single instance needs to be downloaded. Specifically:

The fam file is identical for all chromosomes and genotype data formats.
The sample file is identical for chromosomes 1-22 in the imputed data.

“”"
So I literally downloaded the sample files for chr1 and copied it for 1-22, and then just downloaded the sample files for X and XY.

Does sample_file take a list of files as well? If not I may need to make a separate MatrixTable for X and XY, and then merge it in…

danking · May 29, 2020, 4:06pm

Curious.

The sample_file parameter does not accept a list. When Neale Lab did the GWAS of all the phenotypes, they treated the autosomes, x, and y each as separate regressions: https://github.com/Nealelab/UK_Biobank_GWAS/blob/8f8ee456fdd044ce6809bb7e7492dc98fd2df42f/0.2/run_regressions.biomarkers.py

acererak · May 29, 2020, 4:09pm

Sounds good, I will likely do the same.

tpoterba · May 29, 2020, 4:11pm

import_bgen expects a list of files that correspond to genomic chunks with the same samples, so it doesn’t make sense to have many sample files. If you were importing many files, each with a different sample file, you’d have to get back a list of MTs, not just one!

acererak · May 29, 2020, 4:16pm

@danking If I run index_bgen() as you suggest, with the input as a list of files, can I also use glob or list for the contig_recoding?

I have 01 instead of 1, 02 in stead of 2, which throws a HailException: Invalid locus

acererak · May 29, 2020, 4:17pm

Ah, I see it takes a dict of str:str. That should work!

Topic		Replies	Views
Bgen and bgen index in different directories Hail Query & hailctl	5	835	December 7, 2018
Table from pandas dataframe/aggregate problem Hail Query & hailctl	20	1487	January 23, 2020
[BreakingChange] Changes to index_bgen and import_bgen Updates	0	726	September 19, 2018
Index_bgen() on UKBB imputed data expected time Hail Query & hailctl	15	803	January 25, 2022
UK Biobank DRAGEN WGS BGEN files use 16-bit probabilities that are incompatible with Hail Hail Query & hailctl	4	68	May 12, 2025

Generating index files vs. using pre-generated index for BGEN

Related topics