Hail for GEL .bgen import in UKB RAP

albbaous07 · May 26, 2025, 11:58am

Hi all,

I am trying to compute info scores using HAIL in the UKB RAP and GEL imputed .bgen files which are zstd compressed. I am doing this in the HAIL-VEP feature of ukb. I am trying to follow this github doc on importing the bgen files to begin with OpenBio/hail_tutorial/BGEN_import.ipynb at master · dnanexus/OpenBio · GitHub and i have written the following code to adjust that:

from pyspark.sql import SparkSession
import hail as hl

HAIL_DIR = “/opt/conda/lib/python3.11/site-packages/hail”
JAR_PATH = f"{HAIL_DIR}/backend/hail-all-spark.jar"

spark = (
SparkSession.builder
.config(“spark.jars”, JAR_PATH)
.config(“spark.driver.extraClassPath”, JAR_PATH)
.config(“spark.executor.extraClassPath”, “./hail-all-spark.jar”)
.enableHiveSupport()
.getOrCreate()
)

hl.init(sc=spark.sparkContext)

import os

bgen_path = “/mnt/project/Bulk/Imputation/GEL”
filename = “ukb21008_c22_b0_v1”

for filename in os.listdir(bgen_path):
if not filename.endswith(“.bgen”):
continue # skip non-BGEN files

file_url = f"file://{bgen_path}/{filename}"
index_file = f"hdfs:///{filename}.idx2"

hl.index_bgen(
    path=file_url,
    index_file_map={file_url: index_file},
    reference_genome="GRCh38",
    contig_recoding=None,
    skip_invalid_loci=False
)

index_file_map = {}
for filename in os.listdir(bgen_path):
index_file_map[f"file://{bgen_path}/{filename}“] = f"hdfs:///{filename}.idx2”

print(f"Num partitions: {mt.n_partitions()}")
mt.describe()

I am stuck on this forever though
Loading BokehJS …
pip-installed Hail requires additional configuration options in Spark referring

to the path to the Hail Python module directory HAIL_DIR,*
e.g. /path/to/python/site-packages/hail:*
spark.jars=HAIL_DIR/backend/hail-all-spark.jar*
spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar*
spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.5.2*
SparkUI available at http://ip-10-60-38-192.eu-west-2.compute.internal:8084
Welcome to
```
__  __     <>__*
```
/ /_/ /__ __/ /*
/ __ / _ `/ / /*
// //_,//_/ version 0.2.132-678e1f52b999*
LOGGING: writing to /opt/notebooks/hail-20250526-1147-0.2.132-678e1f52b999.log

The spark UI isnt telling me much either. It is a 20gb file for chrom 22 and i am running it on the mem2_ssd1_v2_x8 partition. I just want to get info scores, allele frequencies, hwe and relatdness for my GWAS qc but if it wont even import the bgen in a timely manner, for the smallest of chromosome, i am feeling a bit helpless.

Topic		Replies	Views
Error Indexing BGEN files Hail Query & hailctl	3	647	February 1, 2019
Importing large BGEN into Hail Matrix Table Hail Query & hailctl	4	484	July 2, 2021
Best practices for UK Biobank Imputed Data Hail Query & hailctl	9	1378	March 19, 2022
hail.java.FatalError: FileNotFoundException: ... .bgen.idx does not exist Help [0.1]	3	1120	November 10, 2017
Export bgen from VDS for 14 million variants and 414k samples of AoU Hail Query & hailctl	0	18	May 26, 2025

Hail for GEL .bgen import in UKB RAP

Related topics