Reading BGEN file using Hail on a Spark cluster results in corrupted matrix table

I have a BGEN file that I obtained from a PGEN file via PLINK2. The command I used is as follows:

plink2 --pfile my_file --export bgen-1.2 bits=8 --out /some/path --output-chr chrM

and I used hl.index_bgen() to create an idx2 index. Now when I have the BGEN file and the idx2 directory on a local file system and run Hail locally, I can without no issues at all do

import hail as hl

mt = hl.import_bgen("/my/file.bgen", entry_fields=["GT"], sample_file="/my/file.bgen.sample")
# the describe() output looks good to me

But when I upload all files to S3, and then use Hail backed by a Spark cluster running in AWS EMR, I can get the matrix table and do mt.describe(). But produces what looks like garbled, half binary-ish output (see the attached screenshot) and my Python interpreter hangs. I tried reuploading the files to S3 in case they got corrupted, but that didn’t help.

Would any of you have an idea what could be going on here?

Something is going wrong with reading the contig string at some variant. I can see a buried error in there about the contig not being found, because the contig is a string containing multiple full binary records from the BGEN.

Do you know if it’s layout 1 or layout 2?

Thanks, @tpoterba! Well spotted re the buried error.
To be honest, I don’t know whether it’s layout 1 or two, or how I would check. From the link you sent me, it seems like layout 2 is the more modern one and the one I assume PLINK would pick as a default when outputting BGEN v1.2. I did run hexdump -C on the BGEN file, and the first couple bytes look as follows:

Not quite what I would have expected after reading the first few paragraphs of the specs. After that, it seems like a long list of sample IDs.
Is using hexdump or just head -c <some small number> the way you would check for the layout version?

Based on your invocation above you created a 1.2 file. Standard data input - PLINK 2.0

Is that the correct BGEN format for use with Hail? I guess I could try v1.3, too.

1.3 and 1.2 should both work.

Do you have a checksum for both the local file and the remote file?

Is the BGEN in S3?

Can you share the exact script that succeeds locally and the exact script that fails on EMR?

If you use S3 mount point to mount the file into an EC2 VM are you able to inspect the file using other BGEN tools?

And is it possible to slice out a tiny number of variants that still fails that you can share with us? It’s hard to make forward progress on this without a replicable example of failure.

No, I don’t. But I mounted the remote file using s3fs and ran both a head -c <integer representation of 1e9> <file> | sha256sum and the same for tail on both the remote and the local file, and the sha256s are identical. Just for good measure I also calculated the full sha256 of <file>.idx2/index files both remote and locally and in turns out that the sha256s are different :exploding_head: I then copied over the local .idx2 directory to S3 and now the above test code seems to work with both local and remote files.

Thanks for pointing me towards calculating checksums! I only thought about the BGEN file being corrupted, which it wasn’t, but the index hadn’t come to my mind.
I’ll try to repeat my actual calculation and in case I run into any further similar troubles, I’ll let you know, and if not, I’ll mark this as answered.
Thanks again! :slightly_smiling_face:

1 Like