I’m working on a project that requires us to understand MatrixTable format look like on disk. Is there any documentation on the architecture or code pointers to this?
Can you share a bit more about why this is a requirement? There’s nothing proprietary about the matrixtable encoding, but it’s not designed in a way that makes it super easy to build a reader separate from our software stack. Hail files have programmable encodings, and we hope to add new + better encodings soon, so even if you got a reader working today, it might be broken in 3 months!
The potential use case would be to encode the same information in a set of parquet files or similar so that the information in the
.mt table could be added to a data science data catalog.
even if you got a reader working today, it might be broken in 3 months
Does that mean that a Hail
.mt file wouldn’t be usable in some new version?
And what would be a programmable encoding?
Hail native files are parameterized by blocking and compression at the stream level, transformations at the buffer level to change how primitives are written (for instance, LEB128 encoding), then typed values are parameterized as “ETypes” (encoded types) that programmatically define encoder/decoder methods: hail/EType.scala at d12321dbefc79450e0e749c4bc9dc1b90fcdcb23 · hail-is/hail · GitHub
We’ll be undertaking a project to add more/better ETypes soon, and so the current version of Hail won’t be able to read files in the future (no forward-compatibility). However, newer versions of Hail will continue to be able to read older files.
It sounds like the right approach here might be to use Hail to export some other format, rather than reading Hail native files directly.
Got it. Exporting is probably the way forward.
If there were a need to read back again, is there a set of interfaces that could be implemented so that the data structure/ information that Hail needs is can be supplied by something other than a
Sure, Hail already supports importing from vcf / bgen / plink / avro / tsvs. At its core, MT is just another structured data format.