My question is about the rule that pandas uses to compare a column with type "object" with an integer. Here is my code:
In [334]: df
Out[334]:
c1 c2 c3 c4
id1 1 li -0.367860 5
id2 2 zhao -0.596926 5
id3 3 sun 0.493806 5
id4 4 wang -0.311407 5
id5 5 wang 0.253646 5
In [335]: df < 2
Out[335]:
c1 c2 c3 c4
id1 True True True False
id2 False True True False
id3 False True True False
id4 False True True False
id5 False True True False
In [336]: df.dtypes
Out[336]:
c1 int64
c2 object
c3 float64
c4 int64
dtype: object
Why does the "c2" column get True
for all?
P.S. I also tried:
In [333]: np.less(np.array(["s","b"]),2)
Out[333]: NotImplemented
For DataFrames, comparison with a scalar always returns a DataFrame having all Boolean columns.
I don't think it's documented anywhere officially, but there's a comment in the source code (see below) confirming the intended behaviour:
[for] straight boolean comparisons [between a DataFrame and a scalar] we want to allow all columns (regardless of dtype to pass thru) See #4537 for discussion.
In practice, this means that all comparisons for every column must return either True
or False
. Any invalid comparison (such as 'li' < 2
) should default to one of these Boolean values.
Put simply, the pandas developers decided that it should default to True
.
There's some discussion of this behaviour in #4537 and some argument to use False
instead, or restrict the comparison to only columns with compatible types, but the ticket was closed and no code was changed.
If you're interested, you can see where the default value is used for invalid comparisons in an internal method found in ops.py:
def _comp_method_FRAME(cls, func, special):
str_rep = _get_opstr(func, cls)
op_name = _get_op_name(func, special)
@Appender('Wrapper for comparison method {name}'.format(name=op_name))
def f(self, other):
if isinstance(other, ABCDataFrame):
# Another DataFrame
if not self._indexed_same(other):
raise ValueError('Can only compare identically-labeled '
'DataFrame objects')
return self._compare_frame(other, func, str_rep)
elif isinstance(other, ABCSeries):
return _combine_series_frame(self, other, func,
fill_value=None, axis=None,
level=None, try_cast=False)
else:
# straight boolean comparisons we want to allow all columns
# (regardless of dtype to pass thru) See #4537 for discussion.
res = self._combine_const(other, func,
errors='ignore',
try_cast=False)
return res.fillna(True).astype(bool)
f.__name__ = op_name
return f
The else
block is the one we're interested in for the scalar case.
Note the errors='ignore'
argument, meaning an invalid comparison will return NaN
(instead of raising an error). The res.fillna(True)
fills these failed comparisons with True
.