Search code examples
pythonpandas-groupbyaggregatecategorical-data

Pandas' groupby doesn't process categoritcal columns in agg function


I have a Dataframe with 3 columns, two being categorial data and one float16. When I do a groupby and run a lambda specific function in the agg to process each column differently according to the dtype there is a drop on the categorical column.

If done this way, it does work.

i=pd.DataFrame({"A":["a","a","a","b","c","c"],"B":[1,2,3,4,5,6],"C":[ "NaN" ,"b","NaN","b","c","c"]})
i['A'] = i['A'].astype('category')   
i['B'] = i['B'].astype('float16')   
i.groupby("A", as_index=False)[["B","C"]].agg(lambda x: x.mean() if np.dtype(x)=='float16' else x.value_counts().index[0])

The output, which is what I would like to get to is:

    A   B   C
0   a   2.0 NaN
1   b   4.0 b
2   c   5.5 c

However whenever I declare column C to be categorical, python automatically drops column C.

i=pd.DataFrame({"A":["a","a","a","b","c","c"],"B":[1,2,3,4,5,6],"C":[ "NaN" ,"b","NaN","b","c","c"]})
i['A'] = i['A'].astype('category')   
i['B'] = i['B'].astype('float16')   
i['C'] = i['C'].astype('category')  
i.groupby("A", as_index=False)[["B","C"]].agg(lambda x: x.mean() if np.dtype(x)=='float16' else x.value_counts().index[0])

And the answer is as follows:

['C'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.

A   B
0   a   2.0
1   b   4.0
2   c   5.5

Does anyone known whether agg from groupby is not able to handle categorical columns?


Solution

  • category is a pandas data type. numpy doesn't necessarily play well with it (np.dtype(i.C) gives an error). Use the pandas.Series.dtype and it should work as expected.

    foo = lambda x: x.mean() if x.dtype =='float16' else x.value_counts().index[0]
    i.groupby("A", as_index=False)[["B","C"]].agg(foo)
    #    A    B    C
    # 0  a  2.0  NaN
    # 1  b  4.0    b
    # 2  c  5.5    c