Mt.count() orders of magnitude slower after union_cols() - why?

I am trying to combine two Hail matrix tables but performance is trailing off massively after the join. To give a simple example mt.count() is performed almost instantaneously on the two base tables but is taking hours on the jointed table (which is actually smaller than one of the original base tables). I am working on a Spark cluster via Terra.Bio.

Can anyone explain why this is and if there is anything I can do to improve performance?

The original matrices I am reading in are 12,573 x 33,931,725 and 33,004 x 33,931,725.
I am subsetting the second matrix to 6,608 x 33,931,725.
Applying count() on any of these matrices is almost instantaneous.

I then join the subsetted tables using the code below:

# Subset tables
mt_con_sub = control_mt.filter_cols(hl.literal(con_sample_list).contains(control_mt.s))
mt_case_sub = case_mt.filter_cols(hl.literal(case_sample_list).contains(case_mt.s))
# Join
mt = mt_con_sub.union_cols(mt_case_sub)
# Check size of new table
mt.count()

mt.count() now runs for hours (I don’t have an output yet) despite the join appearing successful.

mt.n_partitions() shows there are 5,000.

I appreciate these are big datasets, but what I find most odd is the change in performance of count by so many orders of magnitude.

I am trying to write the new combined matrix to file and then I will read it back in via a different notebook to see if that improves performance but any guidance on how I can improve efficiency would be very welcome - it’s getting expensive!

Thanks for your time and any help you can provide,
Angus

count() isn’t doing any real work before the join – it’s returning a number stored in file metadata. counting the variants after union_cols needs to execute the inner join.

What size cluster are you using? This is a lot of data to process; more nodes will help.

Thanks! I had assumed count() was doing the same thing both times.

I increased the cluster size to 20 workers and wrote a new matrix table over a few hours. Reading it back in and executing count() was again almost instantaneous (which presumably helped because count now comes from the new files metadata as you mentioned).

Thank you!
Angus