I am trying to combine two Hail matrix tables but performance is trailing off massively after the join. To give a simple example mt.count() is performed almost instantaneously on the two base tables but is taking hours on the jointed table (which is actually smaller than one of the original base tables). I am working on a Spark cluster via Terra.Bio.
Can anyone explain why this is and if there is anything I can do to improve performance?
The original matrices I am reading in are 12,573 x 33,931,725 and 33,004 x 33,931,725.
I am subsetting the second matrix to 6,608 x 33,931,725.
Applying count() on any of these matrices is almost instantaneous.
I then join the subsetted tables using the code below:
# Subset tables
mt_con_sub = control_mt.filter_cols(hl.literal(con_sample_list).contains(control_mt.s))
mt_case_sub = case_mt.filter_cols(hl.literal(case_sample_list).contains(case_mt.s))
# Join
mt = mt_con_sub.union_cols(mt_case_sub)
# Check size of new table
mt.count()
mt.count() now runs for hours (I don’t have an output yet) despite the join appearing successful.
mt.n_partitions() shows there are 5,000.
I appreciate these are big datasets, but what I find most odd is the change in performance of count by so many orders of magnitude.
I am trying to write the new combined matrix to file and then I will read it back in via a different notebook to see if that improves performance but any guidance on how I can improve efficiency would be very welcome - it’s getting expensive!
Thanks for your time and any help you can provide,
Angus