Search code examples
pythonpandasboolean-operations

Pandas boolean operations are inconsistent with one comparison vs. many comparisons


I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).

When I run the following operation, given a value of i within the df range, the comparison works:

df.loc[i, 'Column'] != None 

The rows that have a value of None in 'Column' give the results False.

But when I run this operation:

df.loc[0:len(df), 'Column'] != None 

The boolean array comes back as all True.

Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?

I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.


Solution

  • Here's a reproducible example of what you're seeing:

    x = pd.Series([1, None, 3, None, None])
    
    print(x != None)
    
    0    True
    1    True
    2    True
    3    True
    4    True
    dtype: bool
    

    What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None values to np.nan:

    print(x)
    
    0    1.0
    1    NaN
    2    3.0
    3    NaN
    4    NaN
    dtype: float64
    

    The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan by design, your Boolean series will contain only True values, even if you were to test against np.nan instead of None.

    For efficiency and correctness, you should use pd.to_numeric with isnull / notnull for checking null values:

    print(pd.to_numeric(x, errors='coerce').notnull())
    
    0     True
    1    False
    2     True
    3    False
    4    False
    dtype: bool