I’m wondering how does HAIL partition data exactly? Does each spark partition contain the data for a portion of the entries (markers/variants and a portion of the samples)? let’s say we have a MatrixTable with 40M SNPs and 1000 samples. Could it happen that each spark partition contains portions of the data for 10k SNPs and 100 samples, or is it guaranteed that each partition will contain all samples?
and a follow up… whatever the answer is, has that changed in recent years or has it always been like that since 2015?
Hail Tables and MatrixTables are only partitioned along the row axis. Tables and MTs created from VCFs always have variants for rows, so: Hail partitions contain all the samples at a contiguous interval of rows. These intervals are always non-overlapping.
This has not changed since 2015; however, we’re actively designing a “blocked” matrix table because we anticipate memory needs making 2M sample rows impractical.
Why do you ask?
hey dan! thanks. that makes sense.
I’m asking since I’m giving a presentation about a past project I’ve worked at 23andme, but realized I don’t remember some basic things about Hail, like how it partitions the MatrixTable It’s been over 3 years.
But now I have another question related to the partitioning of the columns (sample) Table. I understand that if the MT is partitioned by rows (given some key), and the row (variant) Table also has that key, it makes sense that the variant Table would also be partitioned and distributed across the executors based on that key. That would make any joins between the variant Table and MT efficient.
The columns (sample) Table, however, is also distributed by design. So we have a situation where we are guaranteed to have entries for all samples, for some genomic region on a single partition, but the samples Table is distributed across many partitions. That would require some shuffling and make joins between the columns Table and MT less efficient. Is that a correct description?