New to pandas I want to perform something similar to Reduce number of levels for large categorical variables (binning of categorical variables in order to reduce their levels) The following code works fine in R
DTsetlvls <- function(x, newl)
setattr(x, "levels", c(setdiff(levels(x), newl), rep("other", length(newl))))
My dataframe:
df = pd.DataFrame({'Color': 'Red Red Blue'.split(),
'Value': [100, 150, 50]})
df['Counts'] = df.groupby('Color')['Value'].transform('count')
print (df)
Color Value Counts
0 Red 100 2
1 Red 150 2
2 Blue 50 1
I manually would create an aggregate column and then based on that, label the less frequent groups e.g. "blue" as a single "other" group. But compared to the concise R code this seems clumsy. What would be the right approach here?
You can use value_counts
with numpy.where
, where is condition with isin
.
If your variable is of type object see below. If your variable is of type category, then skip down toward the bottom.
df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(),
'Value':[11,150,50,30,10,40]})
print (df)
Color Value
0 Red 11
1 Red 150
2 Blue 50
3 Red 30
4 Violet 10
5 Blue 40
a = df.Color.value_counts()
print (a)
Red 3
Blue 2
Violet 1
Name: Color, dtype: int64
#get top 2 values of index
vals = a[:2].index
print (vals)
Index(['Red', 'Blue'], dtype='object')
df['new'] = np.where(df.Color.isin(vals), 0,1)
print (df)
Color Value new
0 Red 11 0
1 Red 150 0
2 Blue 50 0
3 Red 30 0
4 Violet 10 1
5 Blue 40 0
Or if need replace all not top values use where
:
df['new1'] = df.Color.where(df.Color.isin(vals), 'other')
print (df)
Color Value new1
0 Red 11 Red
1 Red 150 Red
2 Blue 50 Blue
3 Red 30 Red
4 Violet 10 other
5 Blue 40 Blue
For category type:
df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(),
'Value':[11,150,50,30,10,40]})
df.Color = df.Color.astype('category')
a= df.Color.value_counts()[:2].index
print(a)
CategoricalIndex(['Red', 'Blue'],
categories=['Blue', 'Red', 'Violet'],
ordered=False, dtype='category')
Notice that violet is still a category. So we need .remove_unused_categories()
.
vals = df.Color.value_counts()[:2].index.remove_unused_categories()
CategoricalIndex(['Red', 'Blue'],
categories=['Blue', 'Red'],
ordered=False, dtype='category')
As mentioned in the comments, a ValueError will occur when setting the new variable. The way around that is type changing.
df['new1'] = df.Color.astype('object').where(df.Color.isin(vals), 'other')
df['new1'] = df['new1'].astype('category')