Hi all,
I am trying to compute info scores using HAIL in the UKB RAP and GEL imputed .bgen files which are zstd compressed. I am doing this in the HAIL-VEP feature of ukb. I am trying to follow this github doc on importing the bgen files to begin with OpenBio/hail_tutorial/BGEN_import.ipynb at master · dnanexus/OpenBio · GitHub and i have written the following code to adjust that:
from pyspark.sql import SparkSession
import hail as hl
HAIL_DIR = “/opt/conda/lib/python3.11/site-packages/hail”
JAR_PATH = f"{HAIL_DIR}/backend/hail-all-spark.jar"
spark = (
SparkSession.builder
.config(“spark.jars”, JAR_PATH)
.config(“spark.driver.extraClassPath”, JAR_PATH)
.config(“spark.executor.extraClassPath”, “./hail-all-spark.jar”)
.enableHiveSupport()
.getOrCreate()
)
hl.init(sc=spark.sparkContext)
import os
bgen_path = “/mnt/project/Bulk/Imputation/GEL”
filename = “ukb21008_c22_b0_v1”
for filename in os.listdir(bgen_path):
if not filename.endswith(“.bgen”):
continue # skip non-BGEN files
file_url = f"file://{bgen_path}/{filename}"
index_file = f"hdfs:///{filename}.idx2"
hl.index_bgen(
path=file_url,
index_file_map={file_url: index_file},
reference_genome="GRCh38",
contig_recoding=None,
skip_invalid_loci=False
)
index_file_map = {}
for filename in os.listdir(bgen_path):
index_file_map[f"file://{bgen_path}/{filename}“] = f"hdfs:///{filename}.idx2”
print(f"Num partitions: {mt.n_partitions()}")
mt.describe()
I am stuck on this forever though
Loading BokehJS …
pip-installed Hail requires additional configuration options in Spark referring
- to the path to the Hail Python module directory HAIL_DIR,*
- e.g. /path/to/python/site-packages/hail:*
- spark.jars=HAIL_DIR/backend/hail-all-spark.jar*
- spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar*
- spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.5.2*
SparkUI available at http://ip-10-60-38-192.eu-west-2.compute.internal:8084
Welcome to -
__ __ <>__*
- / /_/ /__ __/ /*
- / __ / _ `/ / /*
- // //_,//_/ version 0.2.132-678e1f52b999*
LOGGING: writing to /opt/notebooks/hail-20250526-1147-0.2.132-678e1f52b999.log
The spark UI isnt telling me much either. It is a 20gb file for chrom 22 and i am running it on the mem2_ssd1_v2_x8 partition. I just want to get info scores, allele frequencies, hwe and relatdness for my GWAS qc but if it wont even import the bgen in a timely manner, for the smallest of chromosome, i am feeling a bit helpless.