Speeding up filtering and conversion of BlockMatrix

tudballm · January 16, 2025, 9:06pm

Hello,

I have a simple operation I am trying to run many times in Hail and I would like to see if I could speed it up. Briefly, I want to subset an LD matrix using a list of indices, then convert to a numpy array.

My current workflow involves using hl.experimental.load_dataset to load one of the available BlockMatrix LD matrices. I then filter this matrix using a small list of indices (5-7k variants each) which I obtain separately. Finally, I convert the filtered matrix to numpy using to_numpy for downstream analysis that I can’t do inside Hail. The issue I’m having is that I need to do this subsetting operation many thousands of times, essentially a small moving window along the genome. Speeding this up would therefore lead to a significant improvement in the pipeline’s overall speed. Since the number of variants is small, my understanding is that they are often all contained within the same block, and so I can’t exploit parallelism in the to_numpy function. The terminal indicates that it is typically using just a single core during this operation.

My function is below:

def subset_block_matrix(idx, mat):
    '''
    Subsets a Hail BlockMatrix to a specified index
    
    @param idx: List of indices to subset
    @param mat: Hail BlockMatrix to subset
    @return A numpy array of the subsetted matrix
    '''

    idx = idx.tolist()
    idx = list(map(int, idx))
    mat_sub = mat.filter(idx, idx).to_numpy()

    return mat_sub

I’m a Hail novice so there may be an entirely better way of carrying out this operation. Any advice is appreciated!

Topic		Replies	Views
Performance after MatrixTable filtering (repartition question) Hail Query & hailctl	7	1708	December 20, 2018
Subset HailTable based on an array of indexes Hail Query & hailctl	3	314	June 9, 2023
HailException: Cannot create BlockMatrix: Hail Query & hailctl	2	393	February 19, 2020
Speed up of mt.entries().to_pandas() Hail Query & hailctl	2	812	September 11, 2020
Subsetting large data Hail Query & hailctl	3	716	September 28, 2022

Speeding up filtering and conversion of BlockMatrix

Related topics