Matrix table entries and to_spark()

Hi there,

I’m playing around with a large matrixTable and trying to export it to an Apache Spark dataframe for further processing.

Trying to do something like this:
df = mt.entries().to_spark()

I’ve seen the comment on MatrixTable.entries() about the size explosion though. It’s not entirely clear to me why that is the case.

Can someone please shed light on why the data in this case is so large (as compared for example to the original data used to build the MatrixTable)?

Hi Ofer,

For one, you’re melting a two-dimensional structure, where the row (resp. col) key/metadata is stored once per row (resp., column) into a one-dimensional table where it is replicated per entry. I see that Tim is also replying so I’ll stop there for now. :slight_smile:

Jon

A MatrixTable is the union of the row fields, column fields, and entry fields.

Imagine we have the following row (variant) fields:

  • locus
  • alleles
  • info (big struct)

And the following column (sample) fields:

  • sample ID
  • phenos (big struct)

The MatrixTable representation stores the row fields and the column fields only once, unifying them with the entry data on the fly.

The .entries() table is the fully exploded representation, meaning that the variant fields are duplicated per sample, and the sample fields are duplicated per variant.

If we have 10K samples and 10M variants, then my info fields are duplicated 10K times, and my phenos data is duplicated 10M times!

Thanks Tim/Jon,

That makes sense, and very helpful - thank you!

So if I just choose a small subset of Info and small subset of Pheno - then it shouldn’t be that bad. But could go to 100x or more if there’s just a ton of data there. Did I get that right?

Is there a good way way to get as Spark output just ,,, where the row and col are indices (e.g. Int64) as opposed to like a string which is much more costly to repeat? And in conjunction have a way to map those indices into the non-repeated Row / Column tables?

You can definitely do that, yeah. Three tables for genotypes, rows, cols, keyed by the index. But even though this might be efficient on disk, now everything is formulated as a join, which will be inefficient from a compute standpoint.