Hail suitability for trans-eQTL

jeales · April 30, 2023, 5:48pm

Hi Everyone,
We’re looking at solutions for storing and querying the results of a trans-eQTL data set.
So we’re looking at 20k genes x ~5m variants = 1E11 tests (100 billion rows x 8 columns)
We know that a hail MatrixTable and Table would work here
But I’m interested in what kind of execution time we could expect for a query (applied to all rows) for something like p-value<1e-6?
Has anyone tried this out and have any runtime results they could share?
thanks

danking · May 1, 2023, 2:25pm

If I understand correctly, you’re asking about analyzing the results of this dataset. You have already produced it.

What kind of queries do you want to execute? My instinct is to store this as a variant-by-gene matrix. That’s 5M rows by 20,000 columns. By Hail standards, that’s a fairly small matrix. You’ll be able to filter to a set of rows quickly but you cannot quickly filter columns.

You could try storing it as a 100 billion row table. That will probably work, but I don’t think Hail will has any special advantage on such a table compared to BigQuery or another analytics oriented database. I’d also warn you that this representation has a more overhead (space-wise) than the Matrix Table representation.

Regardless of the representation (table vs matrix), the only way I know to enable quick filtering of rows, columns, and rows&columns-together is to store the dataset twice: once normally and once transposed. That’s effectively what a SQL database index will do.

Hail queries should scale linearly in rows and columns, so I recommend testing on a smaller dataset and extrapolating.

Topic		Replies	Views
Is hail a good option for simple querying tasks on a large dataset (using as a "db")? Hail Query & hailctl	4	360	May 15, 2023
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2166	February 8, 2020
Look up single row in table or matrix table Hail Query & hailctl	2	817	September 28, 2022
Optimise querying code Hail Query & hailctl	8	426	September 14, 2020
Mt.key_cols_by().cols().flatten().to_pandas() is too slow Hail Query & hailctl	5	345	August 7, 2023

Hail suitability for trans-eQTL

Related topics