WGS DRAGEN BGEN files were recently released on UKBB.
Using HAIL 2.4 1 on a Spark cluster, I generated Hail index files using the following command in Python:
hl.index_bgen(path=file_url,
index_file_map={file_url:f"hdfs:///{filename}.idx2"},
reference_genome="GRCh38",
skip_invalid_loci=False)
After the index creation completed successfully, I loaded the bgens into Hail matrix table using import_bgen:
mt = hl.import_bgen(path=bgen_file_map,
entry_fields=['GT'],
sample_file=f"file://{bgen_path}/ukb24309_c1_b0_v1.sample",
n_partitions=None,
block_size=None,
index_file_map=index_file_map,
variants=annotation_ht,)
Finally, I computed an aggregate score using hl.agg.sum over all samples and a subset of variants.
When I tried to run this computation and write out a dataframe, I receive the following error:
FatalError: HailException: Hail only supports 8-bit probabilities, found 16.
This occurs whether or not I load in the ‘GP’ entry field in Hail. It also occurs on Hail 2.3.1. Reviewing the Hail source code and documentation, it appears this is a fundamental limitation of Hail, which means the WGS DRAGEN BGEN files may not be usable by Hail.
I began a conversion to 8-bit BGENS with qctools, but it is very slow and not feasible to run across all the UKBB WGS data.
Has anyone else observed this issue and found a workaround?