Could you help me understand when (and why) NA != NA in Hail? This came up in a discussion with a user, where a table join was conducted, and yielded no NA matches from the joined-on column.
This may be to follow numpy/pandas behavior, but seems pretty confusing, as NaN != NaN is IEEE standard, while None != None is a numpy/pandas change from Python’s handling. Since Python is the closest user-facing interface, this seems magical.
In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])
In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True
In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True
Could you help me understand when (and why) NA != NA in Hail
This is a really hard point to get right. I’ll do my best to list missingness treatment in various places.
In Hail expressions, NA == NA => NA and NaN == NaN => False. Top-level missingness always bubbles up, with only a few operations able to circumvent that rule: hl.is_defined, hl.is_missing, as well as short-circuiting in | and &. Note that in the current system, comparisons of nested objects will treat nested missingness as equal:
In [7]: hl.eval(hl.null('int') == hl.null('int'))
<None>
In [8]: hl.eval(hl.float('nan') == hl.float('nan'))
Out[8]: False
You could easily argue that this is a bug:
In [11]: hl.eval(hl.array([hl.null('int')]) == hl.array([hl.null('int')]))
Out[11]: True
In joins, NA == NA => no match and NaN == NaN => no match.
In [1]: ht = hl.utils.range_table(2).key_by(x = hl.null('int'), y = hl.float('nan'))
In [3]: ht.join(ht, 'outer').show()
+-------+---------+-------+-------+
| x | y | idx | idx_1 |
+-------+---------+-------+-------+
| int32 | float64 | int32 | int32 |
+-------+---------+-------+-------+
| NA | NaN | 0 | NA |
| NA | NaN | 1 | NA |
| NA | NaN | NA | 0 |
| NA | NaN | NA | 1 |
+-------+---------+-------+-------+
In some other operations that use table keys, NA == NA => True and NaN == NaN => True:
In [19]: ht = hl.utils.range_table(2).key_by(x = hl.null('int'), y = hl.float('nan'))
In [20]: ht.show()
+-------+-------+---------+
| idx | x | y |
+-------+-------+---------+
| int32 | int32 | float64 |
+-------+-------+---------+
| 0 | NA | NaN |
| 1 | NA | NaN |
+-------+-------+---------+
In [21]: ht.distinct().show()
+-------+-------+---------+
| idx | x | y |
+-------+-------+---------+
| int32 | int32 | float64 |
+-------+-------+---------+
| 0 | NA | NaN |
+-------+-------+---------+
NA == NA and NaN != NaN both sound perfectly correct, and evaluating objects of the same shape / value as equal also seems right.
The user’s query joined on a field that was occasionally missing. In those cases, the join, which was a left join, failed to return the right operand’s row in which the joined-on key was NA in both the left and right operands.
So, x.key = NA y.key = NA, join failed to return y. Is that expected behavior when x.key and y.key are int (or in any other case)? I assumed this was due to numpy-like silent conversion to NaN.
You could easily argue that this [lists of NAs comparing equal] is a bug
I improved some comparisons (<, etc.) recently, but related to NaNs, not missingness. We now have:
>>> a = hl.array([hl.float('nan')])
>>> hl.eval(a == a)
False
I agree, although (1) this will be another behavior change, and (2) this will require making ExtendedOrdering return java.lang.Boolean instead of Boolean (and the analogous change for the C++ and CodeOrdering comparisons).
NA == NA sound[s] perfectly correct
This is not the path we’ve taken so far. We think of NA not as a value, but as a value we don’t know. So comparing two unknown values shouldn’t be True, it should also be unknown. While this diverges from the Python’s behavior with respect to None (and one might argue our NA shouldn’t be converted into Python None), it does match SQL and R, for example:
$ R
> NA == NA
[1] NA
and the join behavior you’re seeing is the standard SQL join behavior.
In some other operations that use table keys, NA == NA => True and NaN == NaN => True
How did this come about? There are two sets of comparisons: the normal <, <=, etc. that are not a total comparison (e.g. NaN compares false with everything) and a total comparison on every type (implemented via compare: Int). To reliably order tables, we need to use the total comparison. That then got used for various operations like distinct.
I’d think this is a bug, but surprisingly, SQL SELECT DISTINCT matches our current behavior (NAs are equal).