Why shouldn't `MatrixTable.write` be used with BGEN files?

01011 · January 30, 2023, 11:23am

I read the following in this thread:

Never use data directly from an import_*. Import the data, write to Hail native format, then read the Hail native format and use that.

So I thought this always applies to all import_* functions. But elsewhere on the forum I read this:

In particular, if your data is already stored in a BGEN file, you should never use MatrixTable.write.

I’m curious why this is the case. Is this a feature of recent versions of Hail? Does this hold true (i.e. never use write for BGEN) even if I’m working with older versions of Hail?

danking · January 30, 2023, 9:21pm

You’re right that the first thread is overly strong.

You should not use data directly from an import unless the source format is good. At the time of writing there are two “good” formats: Hail Matrix Table and partitioned BGEN.

BGEN is binary and compact: it stores each genotype in, I think, two bytes. Writing that to a Hail Matrix Table would store the genotype in, I think, 4 bytes. That doubles your data size. In contrast, parsing a VCF (which is a text file) is an expensive process. Doing that repeatedly is wasteful.

This applies to all versions of Hail.

Topic		Replies	Views
Importing large BGEN into Hail Matrix Table Hail Query & hailctl	4	484	July 2, 2021
Best practices for UK Biobank Imputed Data Hail Query & hailctl	9	1380	March 19, 2022
Reading BGEN file using Hail on a Spark cluster results in corrupted matrix table Hail Query & hailctl	7	590	April 6, 2023
Write Bgen Fatal Error Hail Query & hailctl	3	434	July 17, 2023
How to write a MatrixTable to a file as a tab separated table in wide format? Hail Query & hailctl	3	526	March 17, 2020

Why shouldn't `MatrixTable.write` be used with BGEN files?

Related topics