None != None (NA != NA)


#1

Could you help me understand when (and why) NA != NA in Hail? This came up in a discussion with a user, where a table join was conducted, and yielded no NA matches from the joined-on column.

This may be to follow numpy/pandas behavior, but seems pretty confusing, as NaN != NaN is IEEE standard, while None != None is a numpy/pandas change from Python’s handling. Since Python is the closest user-facing interface, this seems magical.

May relate to: https://github.com/pandas-dev/pandas/issues/20442

In particular:

In [1]: import pandas as pd
In [2]: A = pd.Series(['a', None, 2])
In [3]: B = pd.Series(['a', None, 2])

In [4]: all(A == B)
Out[4]: False
In [5]: all(lambda a, b: a == b for a, b in zip(A, B))
Out[5]: True

In [5]: all(A.values == B.values)
Out[5]: True
In [6]: all(lambda a, b: a == b for a, b in zip(A.values, B.values))
Out[6]: True

#2

Oops, sorry to have missed this!

Could you help me understand when (and why) NA != NA in Hail

This is a really hard point to get right. I’ll do my best to list missingness treatment in various places.

  1. In Hail expressions, NA == NA => NA and NaN == NaN => False. Top-level missingness always bubbles up, with only a few operations able to circumvent that rule: hl.is_defined, hl.is_missing, as well as short-circuiting in | and &. Note that in the current system, comparisons of nested objects will treat nested missingness as equal:
In [7]: hl.eval(hl.null('int') == hl.null('int'))
<None>
In [8]: hl.eval(hl.float('nan') == hl.float('nan'))
Out[8]: False

You could easily argue that this is a bug:

In [11]: hl.eval(hl.array([hl.null('int')]) == hl.array([hl.null('int')]))
Out[11]: True
  1. In joins, NA == NA => no match and NaN == NaN => no match.
In [1]: ht = hl.utils.range_table(2).key_by(x = hl.null('int'), y = hl.float('nan'))

In [3]: ht.join(ht, 'outer').show()
+-------+---------+-------+-------+
|     x |       y |   idx | idx_1 |
+-------+---------+-------+-------+
| int32 | float64 | int32 | int32 |
+-------+---------+-------+-------+
|    NA |     NaN |     0 |    NA |
|    NA |     NaN |     1 |    NA |
|    NA |     NaN |    NA |     0 |
|    NA |     NaN |    NA |     1 |
+-------+---------+-------+-------+
  1. In some other operations that use table keys, NA == NA => True and NaN == NaN => True:
In [19]: ht = hl.utils.range_table(2).key_by(x = hl.null('int'), y = hl.float('nan'))

In [20]: ht.show()
+-------+-------+---------+
|   idx |     x |       y |
+-------+-------+---------+
| int32 | int32 | float64 |
+-------+-------+---------+
|     0 |    NA |     NaN |
|     1 |    NA |     NaN |
+-------+-------+---------+


In [21]: ht.distinct().show()
+-------+-------+---------+
|   idx |     x |       y |
+-------+-------+---------+
| int32 | int32 | float64 |
+-------+-------+---------+
|     0 |    NA |     NaN |
+-------+-------+---------+

#3

NA == NA and NaN != NaN both sound perfectly correct, and evaluating objects of the same shape / value as equal also seems right.

The user’s query joined on a field that was occasionally missing. In those cases, the join, which was a left join, failed to return the right operand’s row in which the joined-on key was NA in both the left and right operands.

So, x.key = NA y.key = NA, join failed to return y. Is that expected behavior when x.key and y.key are int (or in any other case)? I assumed this was due to numpy-like silent conversion to NaN.


#4

we definitely don’t convert to NaN – I think we explicitly filter out missing keys in left/right joins.

You’ve definitely turned up inconsistencies, though. We should get that fixed.


#5

You could easily argue that this [lists of NAs comparing equal] is a bug

I improved some comparisons (<, etc.) recently, but related to NaNs, not missingness. We now have:

>>> a = hl.array([hl.float('nan')])
>>> hl.eval(a == a)
False

I agree, although (1) this will be another behavior change, and (2) this will require making ExtendedOrdering return java.lang.Boolean instead of Boolean (and the analogous change for the C++ and CodeOrdering comparisons).

NA == NA sound[s] perfectly correct

This is not the path we’ve taken so far. We think of NA not as a value, but as a value we don’t know. So comparing two unknown values shouldn’t be True, it should also be unknown. While this diverges from the Python’s behavior with respect to None (and one might argue our NA shouldn’t be converted into Python None), it does match SQL and R, for example:

$ R
> NA == NA
[1] NA

and the join behavior you’re seeing is the standard SQL join behavior.


#6

One more comment:

In some other operations that use table keys, NA == NA => True and NaN == NaN => True

How did this come about? There are two sets of comparisons: the normal <, <=, etc. that are not a total comparison (e.g. NaN compares false with everything) and a total comparison on every type (implemented via compare: Int). To reliably order tables, we need to use the total comparison. That then got used for various operations like distinct.

I’d think this is a bug, but surprisingly, SQL SELECT DISTINCT matches our current behavior (NAs are equal).