Randomly shuffle rows

Hi,

I am using MatrixTable to create inputs and outputs for my AI model. I want to randomly shuffle all rows because of that in the MatrixTable. Is there a way to do that? If not, do you know if there is a way to convert it to a spark object and do it there instead somehow? I tried the following:

tot_num = mt_3.count()[0]
sample_size = int(0.3 * tot_num)
mt_3 = mt_3.sample_rows(p=sample_size / tot_num, seed=42)

But when I check the output: mt_3.show() it does not randomly arrange them, I see only chromosome 1 top rows.

sample_rows randomly throws away rows, it does not reorder them. Hail MatrixTables are always ordered by their key which is usually a locus and alleles. If you want to change the ordering you can do this:

mt = mt.key_by(rand=hl.rand_int64())

I’ll warn you that requires a lot of data movement. You probably don’t want to do that on a Matrix Table containing whole exomes or genomes.

You can also convert to spark by way of a Table: mt.to_table_row_major().to_spark().

1 Like