Efficient merging of bgen shards - export bgen

Does the Hail team have recommendations on efficiently merging shards from export_bgen(…, parallel=“header_per_shard”). I was transforming bgen to pgen, then merging using plink2, but the output was not being accepted by regenie.

Hey @jwillett !

Hmm. I wouldn’t use parallel="header_per_shard", I’d use parallel='separate_header' and then cat (perhaps a 100-way tree of cat s if there’s too many files). The header_per_shard mode is really meant for when you’re going to run some code on each partition separately.

1 Like

Is there a way to remove headers on produced shards to enable the use of cat? I had to move my projects forward, using header_per_shard, given that I will be doing GWAS (so running code on each separate partition is acceptable). I would prefer to combine all the shards into a single file before running GWAS to keep things organized.

The BGEN Format is precisely defined so this should be possible.

According to the BGEN format, the first four bytes are an unsigned little endian 32-bit integer indicating the offset to the start position of the variant data (relative to the end of this 32-bit integer). For each file you can get that offset:

import hailtop.fs as hfs
import struct

x = bytearray(4)
hfs.open(path_to_bgen, 'rb').readinto(x)
offset = struct.unpack('<I', x)[0]
print(offset + 4)

Then you can use tail -c +offset to skip the header and the sample block.

1 Like

Great, thanks! That will save on file count a lot.