I am trying to filter out some rows in my dataframe (with > 400000 rows) where values in one column have the None type. The goal is to leave my dataframe with only rows that have values that are float in the 'Column' column. I plan on doing this by passing in an array of booleans, except that I can't construct my array of booleans properly (they all come back True).
When I run the following operation, given a value of i within the df range, the comparison works:
df.loc[i, 'Column'] != None
The rows that have a value of None in 'Column' give the results False.
But when I run this operation:
df.loc[0:len(df), 'Column'] != None
The boolean array comes back as all True.
Why is this? Is this a pandas bug? An edge case? Intended behaviour for reasons I don't understand?
I can think of other ways to construct my boolean array, though this seems the most efficient. But it bothers me that this is the result I am getting.
Here's a reproducible example of what you're seeing:
x = pd.Series([1, None, 3, None, None])
print(x != None)
0 True
1 True
2 True
3 True
4 True
dtype: bool
What's not obvious is behind the scenes Pandas converts your series to numeric and converts those None
values to np.nan
:
print(x)
0 1.0
1 NaN
2 3.0
3 NaN
4 NaN
dtype: float64
The NumPy array underlying the series can then be held in a contiguous memory block and support vectorised operations. Since np.nan != np.nan
by design, your Boolean series will contain only True
values, even if you were to test against np.nan
instead of None
.
For efficiency and correctness, you should use pd.to_numeric
with isnull
/ notnull
for checking null values:
print(pd.to_numeric(x, errors='coerce').notnull())
0 True
1 False
2 True
3 False
4 False
dtype: bool