Hello,
I have a simple operation I am trying to run many times in Hail and I would like to see if I could speed it up. Briefly, I want to subset an LD matrix using a list of indices, then convert to a numpy array.
My current workflow involves using hl.experimental.load_dataset
to load one of the available BlockMatrix LD matrices. I then filter this matrix using a small list of indices (5-7k variants each) which I obtain separately. Finally, I convert the filtered matrix to numpy using to_numpy
for downstream analysis that I can’t do inside Hail. The issue I’m having is that I need to do this subsetting operation many thousands of times, essentially a small moving window along the genome. Speeding this up would therefore lead to a significant improvement in the pipeline’s overall speed. Since the number of variants is small, my understanding is that they are often all contained within the same block, and so I can’t exploit parallelism in the to_numpy
function. The terminal indicates that it is typically using just a single core during this operation.
My function is below:
def subset_block_matrix(idx, mat):
'''
Subsets a Hail BlockMatrix to a specified index
@param idx: List of indices to subset
@param mat: Hail BlockMatrix to subset
@return A numpy array of the subsetted matrix
'''
idx = idx.tolist()
idx = list(map(int, idx))
mat_sub = mat.filter(idx, idx).to_numpy()
return mat_sub
I’m a Hail novice so there may be an entirely better way of carrying out this operation. Any advice is appreciated!