Hail for GEL .bgen import in UKB RAP

Hi all,

I am trying to compute info scores using HAIL in the UKB RAP and GEL imputed .bgen files which are zstd compressed. I am doing this in the HAIL-VEP feature of ukb. I am trying to follow this github doc on importing the bgen files to begin with OpenBio/hail_tutorial/BGEN_import.ipynb at master · dnanexus/OpenBio · GitHub and i have written the following code to adjust that:

from pyspark.sql import SparkSession
import hail as hl

HAIL_DIR = “/opt/conda/lib/python3.11/site-packages/hail”
JAR_PATH = f"{HAIL_DIR}/backend/hail-all-spark.jar"

spark = (
SparkSession.builder
.config(“spark.jars”, JAR_PATH)
.config(“spark.driver.extraClassPath”, JAR_PATH)
.config(“spark.executor.extraClassPath”, “./hail-all-spark.jar”)
.enableHiveSupport()
.getOrCreate()
)

hl.init(sc=spark.sparkContext)

import os

bgen_path = “/mnt/project/Bulk/Imputation/GEL”
filename = “ukb21008_c22_b0_v1”

for filename in os.listdir(bgen_path):
if not filename.endswith(“.bgen”):
continue # skip non-BGEN files

file_url = f"file://{bgen_path}/{filename}"
index_file = f"hdfs:///{filename}.idx2"

hl.index_bgen(
    path=file_url,
    index_file_map={file_url: index_file},
    reference_genome="GRCh38",
    contig_recoding=None,
    skip_invalid_loci=False
)

index_file_map = {}
for filename in os.listdir(bgen_path):
index_file_map[f"file://{bgen_path}/{filename}“] = f"hdfs:///{filename}.idx2”

print(f"Num partitions: {mt.n_partitions()}")
mt.describe()

I am stuck on this forever though
Loading BokehJS …
pip-installed Hail requires additional configuration options in Spark referring

  • to the path to the Hail Python module directory HAIL_DIR,*
  • e.g. /path/to/python/site-packages/hail:*
  • spark.jars=HAIL_DIR/backend/hail-all-spark.jar*
  • spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar*
  • spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.5.2*
    SparkUI available at http://ip-10-60-38-192.eu-west-2.compute.internal:8084
    Welcome to
  • __  __     <>__*
    
  • / /_/ /__ __/ /*
  • / __ / _ `/ / /*
  • // //_,//_/ version 0.2.132-678e1f52b999*
    LOGGING: writing to /opt/notebooks/hail-20250526-1147-0.2.132-678e1f52b999.log

The spark UI isnt telling me much either. It is a 20gb file for chrom 22 and i am running it on the mem2_ssd1_v2_x8 partition. I just want to get info scores, allele frequencies, hwe and relatdness for my GWAS qc but if it wont even import the bgen in a timely manner, for the smallest of chromosome, i am feeling a bit helpless.