Hello,
I’m trying to understand how to help the query planner perform better on large datasets, as well as how to scale clusters effectively. I have a dataset that’s (73494807, 250115) in size, with about 250k whole genomes. I’m fairly certain that I’ve annotated what I need onto it, but I’m struggling with query performance.
Is it best practice to use the most selective filters first? Or can the query optimizer handle all the filters at once? For example, to find samples that have variants with a REVEL score greater than 0.75 in rare variants in a specific gene A2ML1:
# Load the matrix table into a var called mt, and then:
annotated = db.annotate_rows_db(mt, "dbNSFP_variants")
vars_of_interest = annotated\
.filter_rows((annotated.dbNSFP_variants[0]['REVEL_score'] > 0.75) \
& (annotated.dbNSFP_variants[0]['genename'] == "A2ML1")\
& (annotated.dbNSFP_variants[0]['1000Gp3_AF'] <= 0.0001))
target_subs = vars_of_interest.filter_cols(hl.agg.any(vars_of_interest.GT.is_non_ref()))
Would I be better off filtering first by genename, and then in later steps applying the other filters? Also, what’s an effective way to scale clusters, should I prefer fewer bigger nodes or more smaller nodes?