How is MatrixTable Entry data partitioned?

roise3000 · October 19, 2023, 1:16am

Hi THere!
I’m wondering how does HAIL partition data exactly? Does each spark partition contain the data for a portion of the entries (markers/variants and a portion of the samples)? let’s say we have a MatrixTable with 40M SNPs and 1000 samples. Could it happen that each spark partition contains portions of the data for 10k SNPs and 100 samples, or is it guaranteed that each partition will contain all samples?

and a follow up… whatever the answer is, has that changed in recent years or has it always been like that since 2015?

danking · October 19, 2023, 3:39pm

Hail Tables and MatrixTables are only partitioned along the row axis. Tables and MTs created from VCFs always have variants for rows, so: Hail partitions contain all the samples at a contiguous interval of rows. These intervals are always non-overlapping.

This has not changed since 2015; however, we’re actively designing a “blocked” matrix table because we anticipate memory needs making 2M sample rows impractical.

Why do you ask?

roise3000 · October 20, 2023, 2:03am

hey dan! thanks. that makes sense.

I’m asking since I’m giving a presentation about a past project I’ve worked at 23andme, but realized I don’t remember some basic things about Hail, like how it partitions the MatrixTable It’s been over 3 years.

But now I have another question related to the partitioning of the columns (sample) Table. I understand that if the MT is partitioned by rows (given some key), and the row (variant) Table also has that key, it makes sense that the variant Table would also be partitioned and distributed across the executors based on that key. That would make any joins between the variant Table and MT efficient.

The columns (sample) Table, however, is also distributed by design. So we have a situation where we are guaranteed to have entries for all samples, for some genomic region on a single partition, but the samples Table is distributed across many partitions. That would require some shuffling and make joins between the columns Table and MT less efficient. Is that a correct description?

Topic		Replies	Views
Question about Rows vs Entries Hail Query & hailctl	1	343	February 19, 2023
Matrix table entries and to_spark() Hail Query & hailctl	4	761	October 17, 2018
Shuffling and writing a MatrixTable appears to run the shuffle op twice Hail Query & hailctl	2	384	August 23, 2021
Hail MatrixTable to Parquet Hail Query & hailctl	5	842	October 26, 2020
Table partitioning Hail Query & hailctl	1	392	July 26, 2021

How is MatrixTable Entry data partitioned?

Related topics