For one, you’re melting a two-dimensional structure, where the row (resp. col) key/metadata is stored once per row (resp., column) into a one-dimensional table where it is replicated per entry. I see that Tim is also replying so I’ll stop there for now.
A MatrixTable is the union of the row fields, column fields, and entry fields.
Imagine we have the following row (variant) fields:
locus
alleles
info (big struct)
And the following column (sample) fields:
sample ID
phenos (big struct)
The MatrixTable representation stores the row fields and the column fields only once, unifying them with the entry data on the fly.
The .entries() table is the fully exploded representation, meaning that the variant fields are duplicated per sample, and the sample fields are duplicated per variant.
If we have 10K samples and 10M variants, then my info fields are duplicated 10K times, and my phenos data is duplicated 10M times!
So if I just choose a small subset of Info and small subset of Pheno - then it shouldn’t be that bad. But could go to 100x or more if there’s just a ton of data there. Did I get that right?
Is there a good way way to get as Spark output just ,,, where the row and col are indices (e.g. Int64) as opposed to like a string which is much more costly to repeat? And in conjunction have a way to map those indices into the non-repeated Row / Column tables?
You can definitely do that, yeah. Three tables for genotypes, rows, cols, keyed by the index. But even though this might be efficient on disk, now everything is formulated as a join, which will be inefficient from a compute standpoint.