Search code examples
pythonpandascategorical-databinning

Pandas reduce number of categorical variables in value_counts() tabulation


New to pandas I want to perform something similar to Reduce number of levels for large categorical variables (binning of categorical variables in order to reduce their levels) The following code works fine in R

DTsetlvls <- function(x, newl)  
   setattr(x, "levels", c(setdiff(levels(x), newl), rep("other", length(newl))))

My dataframe:

df = pd.DataFrame({'Color': 'Red Red Blue'.split(),
                   'Value': [100, 150, 50]})

df['Counts'] = df.groupby('Color')['Value'].transform('count')
print (df)

  Color  Value  Counts
0   Red    100       2
1   Red    150       2
2  Blue     50       1

I manually would create an aggregate column and then based on that, label the less frequent groups e.g. "blue" as a single "other" group. But compared to the concise R code this seems clumsy. What would be the right approach here?


Solution

  • You can use value_counts with numpy.where, where is condition with isin.

    If your variable is of type object see below. If your variable is of type category, then skip down toward the bottom.

    df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(), 
                       'Value':[11,150,50,30,10,40]})
    print (df)
        Color  Value
    0     Red     11
    1     Red    150
    2    Blue     50
    3     Red     30
    4  Violet     10
    5    Blue     40
    
    a = df.Color.value_counts()
    print (a)
    Red       3
    Blue      2
    Violet    1
    Name: Color, dtype: int64
    
    #get top 2 values of index
    vals = a[:2].index
    print (vals)
    Index(['Red', 'Blue'], dtype='object')
    

    df['new'] = np.where(df.Color.isin(vals), 0,1)
    print (df)
        Color  Value  new
    0     Red     11    0
    1     Red    150    0
    2    Blue     50    0
    3     Red     30    0
    4  Violet     10    1
    5    Blue     40    0
    

    Or if need replace all not top values use where:

    df['new1'] = df.Color.where(df.Color.isin(vals), 'other')
    print (df)
        Color  Value   new1
    0     Red     11    Red
    1     Red    150    Red
    2    Blue     50   Blue
    3     Red     30    Red
    4  Violet     10  other
    5    Blue     40   Blue
    

    For category type:

    df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(), 
                       'Value':[11,150,50,30,10,40]})
    df.Color = df.Color.astype('category')
    
    a= df.Color.value_counts()[:2].index
    print(a)
    CategoricalIndex(['Red', 'Blue'], 
                    categories=['Blue', 'Red', 'Violet'], 
                    ordered=False, dtype='category')
    

    Notice that violet is still a category. So we need .remove_unused_categories().

    vals = df.Color.value_counts()[:2].index.remove_unused_categories()
    CategoricalIndex(['Red', 'Blue'], 
                     categories=['Blue', 'Red'], 
                     ordered=False, dtype='category')
    

    As mentioned in the comments, a ValueError will occur when setting the new variable. The way around that is type changing.

    df['new1'] = df.Color.astype('object').where(df.Color.isin(vals), 'other')
    df['new1'] = df['new1'].astype('category')