None != None (NA != NA)

Alex_Kotlar · December 14, 2018, 7:59pm

Could you help me understand when (and why) NA != NA in Hail? This came up in a discussion with a user, where a table join was conducted, and yielded no NA matches from the joined-on column.

This may be to follow numpy/pandas behavior, but seems pretty confusing, as NaN != NaN is IEEE standard, while None != None is a numpy/pandas change from Python’s handling. Since Python is the closest user-facing interface, this seems magical.

May relate to: https://github.com/pandas-dev/pandas/issues/20442

In particular:

In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])

In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True

In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True

tpoterba · December 17, 2018, 2:07pm

Oops, sorry to have missed this!

Could you help me understand when (and why) NA != NA in Hail

This is a really hard point to get right. I’ll do my best to list missingness treatment in various places.

In Hail expressions, NA == NA => NA and NaN == NaN => False. Top-level missingness always bubbles up, with only a few operations able to circumvent that rule: hl.is_defined, hl.is_missing, as well as short-circuiting in | and &. Note that in the current system, comparisons of nested objects will treat nested missingness as equal:

In [7]: hl.eval(hl.null('int') == hl.null('int'))
<None>

In [8]: hl.eval(hl.float('nan') == hl.float('nan'))
Out[8]: False

You could easily argue that this is a bug:

In [11]: hl.eval(hl.array([hl.null('int')]) == hl.array([hl.null('int')]))
Out[11]: True

In joins, NA == NA => no match and NaN == NaN => no match.

In [1]: ht = hl.utils.range_table(2).key_by(x = hl.null('int'), y = hl.float('nan'))

In [3]: ht.join(ht, 'outer').show()
+-------+---------+-------+-------+
|     x |       y |   idx | idx_1 |
+-------+---------+-------+-------+
| int32 | float64 | int32 | int32 |
+-------+---------+-------+-------+
|    NA |     NaN |     0 |    NA |
|    NA |     NaN |     1 |    NA |
|    NA |     NaN |    NA |     0 |
|    NA |     NaN |    NA |     1 |
+-------+---------+-------+-------+

In some other operations that use table keys, NA == NA => True and NaN == NaN => True:

In [19]: ht = hl.utils.range_table(2).key_by(x = hl.null('int'), y = hl.float('nan'))

In [20]: ht.show()
+-------+-------+---------+
|   idx |     x |       y |
+-------+-------+---------+
| int32 | int32 | float64 |
+-------+-------+---------+
|     0 |    NA |     NaN |
|     1 |    NA |     NaN |
+-------+-------+---------+


In [21]: ht.distinct().show()
+-------+-------+---------+
|   idx |     x |       y |
+-------+-------+---------+
| int32 | int32 | float64 |
+-------+-------+---------+
|     0 |    NA |     NaN |
+-------+-------+---------+

Alex_Kotlar · December 17, 2018, 6:13pm

NA == NA and NaN != NaN both sound perfectly correct, and evaluating objects of the same shape / value as equal also seems right.

The user’s query joined on a field that was occasionally missing. In those cases, the join, which was a left join, failed to return the right operand’s row in which the joined-on key was NA in both the left and right operands.

So, x.key = NA y.key = NA, join failed to return y. Is that expected behavior when x.key and y.key are int (or in any other case)? I assumed this was due to numpy-like silent conversion to NaN.

tpoterba · December 17, 2018, 7:43pm

we definitely don’t convert to NaN – I think we explicitly filter out missing keys in left/right joins.

You’ve definitely turned up inconsistencies, though. We should get that fixed.

cseed · December 19, 2018, 3:38am

You could easily argue that this [lists of NAs comparing equal] is a bug

I improved some comparisons (<, etc.) recently, but related to NaNs, not missingness. We now have:

>>> a = hl.array([hl.float('nan')])
>>> hl.eval(a == a)
False

I agree, although (1) this will be another behavior change, and (2) this will require making ExtendedOrdering return java.lang.Boolean instead of Boolean (and the analogous change for the C++ and CodeOrdering comparisons).

NA == NA sound[s] perfectly correct

This is not the path we’ve taken so far. We think of NA not as a value, but as a value we don’t know. So comparing two unknown values shouldn’t be True, it should also be unknown. While this diverges from the Python’s behavior with respect to None (and one might argue our NA shouldn’t be converted into Python None), it does match SQL and R, for example:

$ R
> NA == NA
[1] NA

and the join behavior you’re seeing is the standard SQL join behavior.

cseed · December 19, 2018, 4:42pm

One more comment:

In some other operations that use table keys, NA == NA => True and NaN == NaN => True

How did this come about? There are two sets of comparisons: the normal <, <=, etc. that are not a total comparison (e.g. NaN compares false with everything) and a total comparison on every type (implemented via compare: Int). To reliably order tables, we need to use the total comparison. That then got used for various operations like distinct.

I’d think this is a bug, but surprisingly, SQL SELECT DISTINCT matches our current behavior (NAs are equal).

Topic		Replies	Views
Way to check if value is NA and replace it Hail Query & hailctl	3	1114	September 11, 2020
How does Hail treat missing calls? Hail Query & hailctl	2	268	October 5, 2023
Filtering table rows using two conditions Hail Query & hailctl	4	925	November 24, 2021
Hl.missing is absent in 0.2.57 Hail Query & hailctl	5	429	April 26, 2021
Negating logical operators Hail Query & hailctl	2	518	February 28, 2022

None != None (NA != NA)

Related topics