Why shouldn't `MatrixTable.write` be used with BGEN files?

I read the following in this thread:

Never use data directly from an import_*. Import the data, write to Hail native format, then read the Hail native format and use that.

So I thought this always applies to all import_* functions. But elsewhere on the forum I read this:

In particular, if your data is already stored in a BGEN file, you should never use MatrixTable.write.

I’m curious why this is the case. Is this a feature of recent versions of Hail? Does this hold true (i.e. never use write for BGEN) even if I’m working with older versions of Hail?

1 Like

You’re right that the first thread is overly strong.

You should not use data directly from an import unless the source format is good. At the time of writing there are two “good” formats: Hail Matrix Table and partitioned BGEN.

BGEN is binary and compact: it stores each genotype in, I think, two bytes. Writing that to a Hail Matrix Table would store the genotype in, I think, 4 bytes. That doubles your data size. In contrast, parsing a VCF (which is a text file) is an expensive process. Doing that repeatedly is wasteful.

This applies to all versions of Hail.

1 Like