Shuffle behavior with/without hl.eval

lw453 · May 1, 2023, 4:12pm

Hello Hail team, I ran into some behavior I found unexpected when using hl.shuffle and was hoping you could shed some light on what I am seeing. Essentially, when I try to shuffle an array and take a subset of the permutation to filter a table, I get an unexpected result if I do not use hl.eval on the permutation subset before filtering. Here is a simple example to demonstrate:

> ht = hl.utils.range_table(100)
> idx_permut = hl.shuffle(hl.range(100))
> ht.filter(hl.set(idx_permut[:10]).contains(ht.idx)).show()
+-------+>
|   idx |
+-------+
| int32 |
+-------+
|    52 |
|    66 |
|    67 |
|    70 |
|    75 |
|    77 |
|    83 |
+-------+
> ht.filter(hl.set(hl.eval(idx_permut[:10])).contains(ht.idx)).show()
+-------+
|   idx |
+-------+
| int32 |
+-------+
|    21 |
|    27 |
|    42 |
|    45 |
|    60 |
|    67 |
|    81 |
|    93 |
|    94 |
|    95 |
+-------+

Without hl.eval, the number of filtered rows doesn’t match the purported size of the subset and it also does not contain the same elements as with hl.eval. Could you help me understand what is going on here? Does it have to do with lazy evaluation?

tpoterba · May 1, 2023, 4:16pm

Does it have to do with lazy evaluation?

Yes.

This:

> idx_permut = hl.shuffle(hl.range(100))
> ht.filter(hl.set(idx_permut[:10]).contains(ht.idx)).show()

is exactly the same as:

> ht.filter(hl.set( hl.shuffle(hl.range(100))[:10]).contains(ht.idx)).show()

If you want to use a variable in a Hail expression in a way that the value is constant (rather than being evaluated every rows/cols/entry) then you can use either hl.literal to make it a “literal expression”, or annotate it into table globals with ht = ht.annotate_globals(idx_permut=...) and use it as ht.idx_permute in row computations.

tpoterba · May 1, 2023, 4:18pm

Also, please feel heartened that this is one of the most complicated areas of the Hail interface – the development team has argued about this for years and roughly concluded that there are other designs that might make your example “predictable”, but would make others more confusing.

Topic		Replies	Views
Performance after MatrixTable filtering (repartition question) Hail Query & hailctl	7	1708	December 20, 2018
Inconsistent output from same code Hail Query & hailctl	2	404	November 4, 2020
Subset HailTable and MatrixTable from i to n Hail Query & hailctl	1	285	May 17, 2023
Filter for specific variants Hail Query & hailctl	1	625	July 23, 2023
Filtering rows based on VEP Hail Query & hailctl	5	396	January 31, 2022

Shuffle behavior with/without hl.eval

Related topics