Ordering unsorted dataset with network shuffle

Hello,

While I am working with MatrixTable, I am getting the message:

2020-09-11 13:05:25 Hail: INFO: Ordering unsorted dataset with network shuffle
2020-09-11 13:05:25 Hail: INFO: Ordering unsorted dataset with network shuffle
2020-09-11 13:05:37 Hail: INFO: Ordering unsorted dataset with network shuffle

Does this mean that the MatrixTable is being sorted after every operation? How can I disable this in order to optimise my code, because I do not need the MatrixTable to be ordered? Is it possible?

Thanks,

Hail joins are executed using ordered merges, and a network shuffle is the implementation strategy to order unsorted datasets. It’s impossible to avoid every shuffle, but if you’re seeing messages like this a lot, it’s probably a good idea to insert a write/read after a key_by or import or other expensive operation:

path = '/some/file'
mt.write(path)
mt = hl.read_matrix_table(path)

The method checkpoint is shorter syntax for the above:

mt = mt.checkpoint('/some/file')
1 Like

Ok, I will try