UK Biobank DRAGEN WGS BGEN files use 16-bit probabilities that are incompatible with Hail

WGS DRAGEN BGEN files were recently released on UKBB.

Using HAIL 2.4 1 on a Spark cluster, I generated Hail index files using the following command in Python:

hl.index_bgen(path=file_url,
index_file_map={file_url:f"hdfs:///{filename}.idx2"},
reference_genome="GRCh38",
skip_invalid_loci=False)

After the index creation completed successfully, I loaded the bgens into Hail matrix table using import_bgen:

mt = hl.import_bgen(path=bgen_file_map,
entry_fields=['GT'],
sample_file=f"file://{bgen_path}/ukb24309_c1_b0_v1.sample",
n_partitions=None,
block_size=None,
index_file_map=index_file_map,
variants=annotation_ht,)

Finally, I computed an aggregate score using hl.agg.sum over all samples and a subset of variants.

When I tried to run this computation and write out a dataframe, I receive the following error:

FatalError: HailException: Hail only supports 8-bit probabilities, found 16.

This occurs whether or not I load in the ‘GP’ entry field in Hail. It also occurs on Hail 2.3.1. Reviewing the Hail source code and documentation, it appears this is a fundamental limitation of Hail, which means the WGS DRAGEN BGEN files may not be usable by Hail.

I began a conversion to 8-bit BGENS with qctools, but it is very slow and not feasible to run across all the UKBB WGS data.

Has anyone else observed this issue and found a workaround?

Thanks for letting us know about these new datasets. I’ve created a tracking issue so that we can add this support.

Thanks! Does this mean support for 16 bit BGEN files?

Eventually yes. I can’t give you a firm timeline at this point.