I have a Dataframe with 3 columns, two being categorial data and one float16. When I do a groupby and run a lambda specific function in the agg to process each column differently according to the dtype there is a drop on the categorical column.
If done this way, it does work.
i=pd.DataFrame({"A":["a","a","a","b","c","c"],"B":[1,2,3,4,5,6],"C":[ "NaN" ,"b","NaN","b","c","c"]})
i['A'] = i['A'].astype('category')
i['B'] = i['B'].astype('float16')
i.groupby("A", as_index=False)[["B","C"]].agg(lambda x: x.mean() if np.dtype(x)=='float16' else x.value_counts().index[0])
The output, which is what I would like to get to is:
A B C
0 a 2.0 NaN
1 b 4.0 b
2 c 5.5 c
However whenever I declare column C to be categorical, python automatically drops column C.
i=pd.DataFrame({"A":["a","a","a","b","c","c"],"B":[1,2,3,4,5,6],"C":[ "NaN" ,"b","NaN","b","c","c"]})
i['A'] = i['A'].astype('category')
i['B'] = i['B'].astype('float16')
i['C'] = i['C'].astype('category')
i.groupby("A", as_index=False)[["B","C"]].agg(lambda x: x.mean() if np.dtype(x)=='float16' else x.value_counts().index[0])
And the answer is as follows:
['C'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
A B
0 a 2.0
1 b 4.0
2 c 5.5
Does anyone known whether agg from groupby is not able to handle categorical columns?
category
is a pandas
data type. numpy
doesn't necessarily play well with it (np.dtype(i.C)
gives an error). Use the pandas.Series.dtype
and it should work as expected.
foo = lambda x: x.mean() if x.dtype =='float16' else x.value_counts().index[0]
i.groupby("A", as_index=False)[["B","C"]].agg(foo)
# A B C
# 0 a 2.0 NaN
# 1 b 4.0 b
# 2 c 5.5 c