I have a pandas DataFrame with several flag/dummy variables of type Int64
.
I am aggregating on other fields and taking the mean value in order to calculate a percent.
df.groupby(["key1", "key2"]).mean()
When I try to take the mean, I get the TypeError: cannot safely cast non-equivalent float64 to int64.
When I try to take the mean of each column one-by-one, I don't receive the error.
I am trying to understand what could cause the error. Any insight would be greatly appreciated.
Here is a description of the data:
In:
df.info()
Out:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6910491 entries, 82222 to 6858085
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 key1 object
1 key2 object
2 cond1 int64
3 cond2 Int64
4 cond1and2 Int64
dtypes: Int64(2), int64(1), object(2)
memory usage: 329.5+ MB
In:
df.describe()
Out:
cond1 cond2 cond1and2
count 6.910491e+06 6.910491e+06 6.910491e+06
mean 2.004735e-02 1.050030e-01 6.695038e-03
std 1.401622e-01 3.065573e-01 8.154885e-02
min 0.000000e+00 0.000000e+00 0.000000e+00
25% 0.000000e+00 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00 0.000000e+00
75% 0.000000e+00 0.000000e+00 0.000000e+00
max 1.000000e+00 1.000000e+00 1.000000e+00
In:
[print(df[c].value_counts(), "\n\n") for c in df]
Out:
c 2220221
d 2208322
b 2195117
a 286831
Name: key1, dtype: int64
1 1925173
4 1680848
3 1656101
2 1648369
Name: key2, dtype: int64
0 6771954
1 138537
Name: cond1, dtype: int64
0 6184869
1 725622
Name: cond2, dtype: Int64
0 6864225
1 46266
Name: cond1and2, dtype: Int64
[None, None, None, None, None]
In:
df.groupby(['key1', 'key2']).mean()
Out:
TypeError Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in safe_cast(values, dtype, copy)
143 try:
--> 144 return values.astype(dtype, casting="safe", copy=copy)
145 except TypeError:
TypeError: Cannot cast array from dtype('float64') to dtype('int64') according to the rule 'safe'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-70-5cec730bfc37> in <module>
----> 1 df.groupby(['key1', 'key2']).mean()
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in mean(self, *args, **kwargs)
1230 nv.validate_groupby_func("mean", args, kwargs, ["numeric_only"])
1231 return self._cython_agg_general(
-> 1232 "mean", alt=lambda x, axis: Series(x).mean(**kwargs), **kwargs
1233 )
1234
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
1002 ) -> DataFrame:
1003 agg_blocks, agg_items = self._cython_agg_blocks(
-> 1004 how, alt=alt, numeric_only=numeric_only, min_count=min_count
1005 )
1006 return self._wrap_agged_blocks(agg_blocks, items=agg_items)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
1091 # Cast back if feasible
1092 result = type(block.values)._from_sequence(
-> 1093 result.ravel(), dtype=block.values.dtype
1094 )
1095 except ValueError:
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in _from_sequence(cls, scalars, dtype, copy)
348 @classmethod
349 def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 350 return integer_array(scalars, dtype=dtype, copy=copy)
351
352 @classmethod
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in integer_array(values, dtype, copy)
129 TypeError if incompatible types
130 """
--> 131 values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
132 return IntegerArray(values, mask)
133
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in coerce_to_array(values, dtype, mask, copy)
245 values = safe_cast(values, dtype, copy=False)
246 else:
--> 247 values = safe_cast(values, dtype, copy=False)
248
249 return values, mask
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in safe_cast(values, dtype, copy)
150
151 raise TypeError(
--> 152 f"cannot safely cast non-equivalent {values.dtype} to {np.dtype(dtype)}"
153 )
154
TypeError: cannot safely cast non-equivalent float64 to int64
Int64
(nullable array) is not the same as int64
(Read more about that here and here).
In order to solve that, change the datatype of those columns with
df[['cond2', 'cond1and2']] = df[['cond2', 'cond1and2']].astype('int64')
or
import numpy as np
df[['cond2', 'cond1and2']] = df[['cond2', 'cond1and2']].astype(np.int64)
Note: If one has missing values (df.describe()
may help one detect them), there are various ways to handle that, such as: remove the rows with missing values or fill the cells that are missing (in my answer here one will see a way to find and handle missing values).
Missing values are frequently indicated by out-of-range entries; perhaps a negative number (e.g., -1) in a numeric field that is normally only positive, or a 0 in a numeric field that can never normally be 0. (Witten, I. H. (2016). Data Mining: Practical Machine Learning Tools and Techniques)
For more information on missing values: