Search code examples
pythonpandasdataframebitwise-operators

Bitwise comparison of "slightly" different DataFrames yield conflicting results


While working on a topic involving the bitwise AND operator I stumbled over the below occurrence.

Accessing the Series of the Pandas DataFrames and performing the same conditional check, the returned result differs.

  1. What is happening under the hood in line 95 and 96?
  2. And why do the outcomes differ for the two dataframes?
In [91]: df = pd.DataFrame({"h": [5300, 5420, 5490], "l": [5150, 5270, 5270]})

In [92]: df
Out[92]: 
      h     l
0  5300  5150
1  5420  5270
2  5490  5270

In [93]: df2 = pd.DataFrame({"h": [5300.1, 5420.1, 5490.1], "l": [5150.1, 5270.1, 5270.1]})

In [94]: df2
Out[94]: 
        h       l
0  5300.1  5150.1
1  5420.1  5270.1
2  5490.1  5270.1

In [95]: df["h"].notna() & df["l"]
Out[95]: 
0    False
1    False
2    False
dtype: bool

In [96]: df2["h"].notna() & df2["l"]
Out[96]: 
0    True
1    True
2    True
dtype: bool

In [97]: 

Solution

  • You've hit some weird implicit casting. I believe what you mean is:

    df["h"].notna() & df["l"].notna()
    

    or perhaps

    df["h"].notna() & df["l"].astype(bool)
    

    In the original,

    df["h"].notna() & df["l"]
    

    you have requested a bitwise operation on two Series, the first of which is dtyped as boolean and the second of which is either integer (in df) or float (in df2).

    In the first case, a boolean can be upcast to an int. It appears that what has happened is that the boolean True is upcast to the integer 1 (binary 0000000001), bitwise-anded with the integers 5150, 5270, and 5270, (which gives 0, since all of those are even). E.g. if you set

    df.loc[2, 'l'] = 5271
    

    you will see that the final value changes to True.

    In the case of df2, a float and a bool cannot be logically anded together. It appears that Pandas here may be implicitly converting the dtype of the float array to bool. numpy itself would not do this:

    In [79]: np.float64([.1, .2]) & np.array([True, True])
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-79-2c2e50f0bf99> in <module>
    ----> 1 np.float64([.1, .2]) & np.array([True, True])
    
    TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
    

    But pandas seems to allow it:

    In [88]: pd.Series([True, True, True]) & pd.Series([0, .1, .2])
    Out[88]:
    0    False
    1     True
    2     True
    dtype: bool
    

    The same results in numpy can be achieved by using astype bool explicitly:

    In [92]: np.array([True, True, True]) & np.float64([0, .1, .2]).astype(bool)
    Out[92]: array([False,  True,  True])