What causes these Int64 columns to cause a TypeError?

I have a pandas DataFrame with several flag/dummy variables of type Int64.

I am aggregating on other fields and taking the mean value in order to calculate a percent.

df.groupby(["key1", "key2"]).mean()

When I try to take the mean, I get the TypeError: cannot safely cast non-equivalent float64 to int64.

When I try to take the mean of each column one-by-one, I don't receive the error.

I am trying to understand what could cause the error. Any insight would be greatly appreciated.

Here is a description of the data:

In:

df.info()

Out:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6910491 entries, 82222 to 6858085
Data columns (total 5 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   key1       object
 1   key2       object
 2   cond1      int64 
 3   cond2      Int64 
 4   cond1and2  Int64 
dtypes: Int64(2), int64(1), object(2)
memory usage: 329.5+ MB

In:

df.describe()

Out:


    cond1   cond2   cond1and2
count   6.910491e+06    6.910491e+06    6.910491e+06
mean    2.004735e-02    1.050030e-01    6.695038e-03
std 1.401622e-01    3.065573e-01    8.154885e-02
min 0.000000e+00    0.000000e+00    0.000000e+00
25% 0.000000e+00    0.000000e+00    0.000000e+00
50% 0.000000e+00    0.000000e+00    0.000000e+00
75% 0.000000e+00    0.000000e+00    0.000000e+00
max 1.000000e+00    1.000000e+00    1.000000e+00

In: 

[print(df[c].value_counts(), "\n\n") for c in df]

Out:

c    2220221
d    2208322
b    2195117
a     286831
Name: key1, dtype: int64 


1    1925173
4    1680848
3    1656101
2    1648369
Name: key2, dtype: int64 


0    6771954
1     138537
Name: cond1, dtype: int64 


0    6184869
1     725622
Name: cond2, dtype: Int64 


0    6864225
1      46266
Name: cond1and2, dtype: Int64 


[None, None, None, None, None]

In: 

df.groupby(['key1', 'key2']).mean()

Out:

TypeError                                 Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in safe_cast(values, dtype, copy)
    143     try:
--> 144         return values.astype(dtype, casting="safe", copy=copy)
    145     except TypeError:

TypeError: Cannot cast array from dtype('float64') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-70-5cec730bfc37> in <module>
----> 1 df.groupby(['key1', 'key2']).mean()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in mean(self, *args, **kwargs)
   1230         nv.validate_groupby_func("mean", args, kwargs, ["numeric_only"])
   1231         return self._cython_agg_general(
-> 1232             "mean", alt=lambda x, axis: Series(x).mean(**kwargs), **kwargs
   1233         )
   1234 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
   1002     ) -> DataFrame:
   1003         agg_blocks, agg_items = self._cython_agg_blocks(
-> 1004             how, alt=alt, numeric_only=numeric_only, min_count=min_count
   1005         )
   1006         return self._wrap_agged_blocks(agg_blocks, items=agg_items)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\groupby\generic.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
   1091                         # Cast back if feasible
   1092                         result = type(block.values)._from_sequence(
-> 1093                             result.ravel(), dtype=block.values.dtype
   1094                         )
   1095                     except ValueError:

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in _from_sequence(cls, scalars, dtype, copy)
    348     @classmethod
    349     def _from_sequence(cls, scalars, dtype=None, copy=False):
--> 350         return integer_array(scalars, dtype=dtype, copy=copy)
    351 
    352     @classmethod

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in integer_array(values, dtype, copy)
    129     TypeError if incompatible types
    130     """
--> 131     values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
    132     return IntegerArray(values, mask)
    133 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in coerce_to_array(values, dtype, mask, copy)
    245         values = safe_cast(values, dtype, copy=False)
    246     else:
--> 247         values = safe_cast(values, dtype, copy=False)
    248 
    249     return values, mask

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\arrays\integer.py in safe_cast(values, dtype, copy)
    150 
    151         raise TypeError(
--> 152             f"cannot safely cast non-equivalent {values.dtype} to {np.dtype(dtype)}"
    153         )
    154 

TypeError: cannot safely cast non-equivalent float64 to int64

Solution

Int64 (nullable array) is not the same as int64 (Read more about that here and here).

In order to solve that, change the datatype of those columns with

df[['cond2', 'cond1and2']] = df[['cond2', 'cond1and2']].astype('int64')

import numpy as np

df[['cond2', 'cond1and2']] = df[['cond2', 'cond1and2']].astype(np.int64)

Note: If one has missing values (df.describe() may help one detect them), there are various ways to handle that, such as: remove the rows with missing values or fill the cells that are missing (in my answer here one will see a way to find and handle missing values).

Missing values are frequently indicated by out-of-range entries; perhaps a negative number (e.g., -1) in a numeric field that is normally only positive, or a 0 in a numeric field that can never normally be 0. (Witten, I. H. (2016). Data Mining: Practical Machine Learning Tools and Techniques)

For more information on missing values: