Matrix table entries and to_spark()

ofermend · October 16, 2018, 10:16pm

Hi there,

I’m playing around with a large matrixTable and trying to export it to an Apache Spark dataframe for further processing.

Trying to do something like this:
df = mt.entries().to_spark()

I’ve seen the comment on MatrixTable.entries() about the size explosion though. It’s not entirely clear to me why that is the case.

Can someone please shed light on why the data in this case is so large (as compared for example to the original data used to build the MatrixTable)?

jbloom · October 16, 2018, 10:33pm

Hi Ofer,

For one, you’re melting a two-dimensional structure, where the row (resp. col) key/metadata is stored once per row (resp., column) into a one-dimensional table where it is replicated per entry. I see that Tim is also replying so I’ll stop there for now.

Jon

tpoterba · October 16, 2018, 10:33pm

A MatrixTable is the union of the row fields, column fields, and entry fields.

Imagine we have the following row (variant) fields:

locus
alleles
info (big struct)

And the following column (sample) fields:

sample ID
phenos (big struct)

The MatrixTable representation stores the row fields and the column fields only once, unifying them with the entry data on the fly.

The .entries() table is the fully exploded representation, meaning that the variant fields are duplicated per sample, and the sample fields are duplicated per variant.

If we have 10K samples and 10M variants, then my info fields are duplicated 10K times, and my phenos data is duplicated 10M times!

ofermend · October 17, 2018, 10:04pm

Thanks Tim/Jon,

That makes sense, and very helpful - thank you!

So if I just choose a small subset of Info and small subset of Pheno - then it shouldn’t be that bad. But could go to 100x or more if there’s just a ton of data there. Did I get that right?

Is there a good way way to get as Spark output just ,,, where the row and col are indices (e.g. Int64) as opposed to like a string which is much more costly to repeat? And in conjunction have a way to map those indices into the non-repeated Row / Column tables?

tpoterba · October 17, 2018, 11:53pm

You can definitely do that, yeah. Three tables for genotypes, rows, cols, keyed by the index. But even though this might be efficient on disk, now everything is formulated as a join, which will be inefficient from a compute standpoint.

Topic		Replies	Views
Sparse matrix table file size is larger than densified matrix table Hail Query & hailctl	6	443	October 5, 2021
Exporting data from MatrixTable into TSV Hail Query & hailctl	7	1379	December 14, 2021
Extracting entry fields into a separate MatrixTable Hail Query & hailctl	8	527	December 14, 2021
Opposite of to_matrix_table_row_major() Hail Query & hailctl	2	294	June 5, 2023
Sparse mt entries question Hail Query & hailctl	5	488	November 7, 2019

Matrix table entries and to_spark()

Related topics