Adding sample labels to a relationship matrix

#1

I have created a hail matrix table from a vcf with 591 samples. I have had good success with using the hail.realized_relationship_matrix. It would be helpful, however, to have row and column labels for the resulting matrix. The solution I came up with was to convert the block matrix to an ndarray and then convert the ndarray to a panda using a list of samples as row and column names:

rrm = hl.realized_relationship_matrix(mt.GT)
rrm_npy = rrm.to_numpy()
samples = mt.s.collect()
rrm_panda = pd.DataFrame(rrm_npy, index=samples, columns=samples)

My question: does this seem like a robust solution? What’s opaque to me is whether the block matrix indices are bound to match the indices of the array created by mt.s.collect().

My kudos to the hail team – it is awesome.

0 Likes

#2

This is a topic that has come up before, I think. I suppose the answer will depend on what you want to do downstream. Your code looks fine, but it won’t scale, and interconverting between Hail objects and python objects can be very slow.

One of the natural things to do may be to convert it to a MatrixTable using this method.

Once it is again a matrix table, you can put the keys back in:

rrm = hl.realized_relationship_matrix(mt.GT)
rrm_mt = rrm.to_matrix_table_row_major()

sample_ids = hl.literal(mt.s.collect())
rrm_mt = rrm_mt.key_rows_by(s1 = sample_ids[rrm_mt.row_idx])
rrm_mt = rrm_mt.key_cols_by(s2 = sample_ids[rrm_mt.col_idx])
0 Likes

#3

Cool.

To get it to work for me, I needed to cast the MatrixTable indices to int32:

rrm_mt = rrm_mt.key_rows_by(s1 = sample_ids[hl.int32(rrm_mt.row_idx)])
rrm_mt = rrm_mt.key_cols_by(s2 = sample_ids[hl.int32(rrm_mt.col_idx)])
0 Likes

#4

ah, yes! That always comes up, and it’s a bit annoying but better than either doing an unsafe cast or an expensive check automatically.

0 Likes