Hail suitability for trans-eQTL

Hi Everyone,
We’re looking at solutions for storing and querying the results of a trans-eQTL data set.
So we’re looking at 20k genes x ~5m variants = 1E11 tests (100 billion rows x 8 columns)
We know that a hail MatrixTable and Table would work here
But I’m interested in what kind of execution time we could expect for a query (applied to all rows) for something like p-value<1e-6?
Has anyone tried this out and have any runtime results they could share?

If I understand correctly, you’re asking about analyzing the results of this dataset. You have already produced it.

What kind of queries do you want to execute? My instinct is to store this as a variant-by-gene matrix. That’s 5M rows by 20,000 columns. By Hail standards, that’s a fairly small matrix. You’ll be able to filter to a set of rows quickly but you cannot quickly filter columns.

You could try storing it as a 100 billion row table. That will probably work, but I don’t think Hail will has any special advantage on such a table compared to BigQuery or another analytics oriented database. I’d also warn you that this representation has a more overhead (space-wise) than the Matrix Table representation.

Regardless of the representation (table vs matrix), the only way I know to enable quick filtering of rows, columns, and rows&columns-together is to store the dataset twice: once normally and once transposed. That’s effectively what a SQL database index will do.

Hail queries should scale linearly in rows and columns, so I recommend testing on a smaller dataset and extrapolating.