Query planner best practices

pettyalex · May 28, 2025, 10:29pm

Hello,

I’m trying to understand how to help the query planner perform better on large datasets, as well as how to scale clusters effectively. I have a dataset that’s (73494807, 250115) in size, with about 250k whole genomes. I’m fairly certain that I’ve annotated what I need onto it, but I’m struggling with query performance.

Is it best practice to use the most selective filters first? Or can the query optimizer handle all the filters at once? For example, to find samples that have variants with a REVEL score greater than 0.75 in rare variants in a specific gene A2ML1:

# Load the matrix table into a var called mt, and then:
annotated = db.annotate_rows_db(mt, "dbNSFP_variants")
vars_of_interest = annotated\
.filter_rows((annotated.dbNSFP_variants[0]['REVEL_score'] > 0.75) \
              & (annotated.dbNSFP_variants[0]['genename'] == "A2ML1")\
            & (annotated.dbNSFP_variants[0]['1000Gp3_AF'] <= 0.0001))

target_subs = vars_of_interest.filter_cols(hl.agg.any(vars_of_interest.GT.is_non_ref()))

Would I be better off filtering first by genename, and then in later steps applying the other filters? Also, what’s an effective way to scale clusters, should I prefer fewer bigger nodes or more smaller nodes?

Topic		Replies	Views
Is hail a good option for simple querying tasks on a large dataset (using as a "db")? Hail Query & hailctl	4	360	May 15, 2023
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2166	February 8, 2020
How to select appropirate cluster specs for hail Hail Query & hailctl	1	362	October 24, 2022
Hail suitability for trans-eQTL Hail Query & hailctl	1	250	May 1, 2023
Hail performance reference/benchmark Hail Query & hailctl	0	465	February 2, 2022

Query planner best practices

Related topics