Mt.count() orders of magnitude slower after union_cols() - why?

abg · April 11, 2023, 9:14pm

I am trying to combine two Hail matrix tables but performance is trailing off massively after the join. To give a simple example mt.count() is performed almost instantaneously on the two base tables but is taking hours on the jointed table (which is actually smaller than one of the original base tables). I am working on a Spark cluster via Terra.Bio.

Can anyone explain why this is and if there is anything I can do to improve performance?

The original matrices I am reading in are 12,573 x 33,931,725 and 33,004 x 33,931,725.
I am subsetting the second matrix to 6,608 x 33,931,725.
Applying count() on any of these matrices is almost instantaneous.

I then join the subsetted tables using the code below:

# Subset tables
mt_con_sub = control_mt.filter_cols(hl.literal(con_sample_list).contains(control_mt.s))
mt_case_sub = case_mt.filter_cols(hl.literal(case_sample_list).contains(case_mt.s))
# Join
mt = mt_con_sub.union_cols(mt_case_sub)
# Check size of new table
mt.count()

mt.count() now runs for hours (I don’t have an output yet) despite the join appearing successful.

mt.n_partitions() shows there are 5,000.

I appreciate these are big datasets, but what I find most odd is the change in performance of count by so many orders of magnitude.

I am trying to write the new combined matrix to file and then I will read it back in via a different notebook to see if that improves performance but any guidance on how I can improve efficiency would be very welcome - it’s getting expensive!

Thanks for your time and any help you can provide,
Angus

tpoterba · April 12, 2023, 12:47am

count() isn’t doing any real work before the join – it’s returning a number stored in file metadata. counting the variants after union_cols needs to execute the inner join.

What size cluster are you using? This is a lot of data to process; more nodes will help.

abg · April 12, 2023, 9:03am

Thanks! I had assumed count() was doing the same thing both times.

I increased the cluster size to 20 workers and wrote a new matrix table over a few hours. Reading it back in and executing count() was again almost instantaneous (which presumably helped because count now comes from the new files metadata as you mentioned).

Thank you!
Angus

Topic		Replies	Views
Union of columns Hail Query & hailctl	7	1659	May 31, 2023
Hail/Apache Spark Not Scaling by Cluster Size Hail Query & hailctl	2	210	February 21, 2024
After the union_cols() number of rows decreases Hail Query & hailctl	1	459	January 19, 2022
Mt.key_cols_by().cols().flatten().to_pandas() is too slow Hail Query & hailctl	5	340	August 7, 2023
Very slow write and count operations after densify Hail Query & hailctl	0	312	January 31, 2023

Mt.count() orders of magnitude slower after union_cols() - why?

Related topics