Search code examples
pythonpandasdownsampling

Downsampling for more than 2 classes


I am creating a simple code which allows to down-sample a dataframe when your target variable has more than 2 classes.

Let df be our arbitrary dataset and 'TARGET_VAR' a categorical variable with more than 2 classes.

import pandas as pd
label='TARGET_VAR' #define the target variable

num_class=df[label].value_counts() #creates list with the count of each class value
temp=pd.DataFrame() #create empty dataframe to be filled up

for cl in num_class.index: #loop through classes
    #iteratively downsample every class according to the smallest
    #class 'min(num_class)' and append it to the dataframe.
    temp=temp.append(df[df[label]==cl].sample(min(num_class)))

df=temp #redefine initial dataframe as the subsample one

del temp, num_class #delete temporary dataframe

Now I was wondering, is there a way to do this in a more refined way? e.g. without having to create the temporary dataset? I tried to figure out a way to "vectorize" the operation for multiple classes but didn't get anywhere. Below is my idea, which can easily be implemented for 2 classes but I have no idea how to expand it to the multiple classes case.

This works perfectly if you have 2 classes

 df= pd.concat([df[df[label]==num_class.idxmin()],\
 df[df[label]!=num_class.idxmin()].sample(min(num_class))])

This allows you to pick the right amount of observations for the other classes but the classes will not necessarily be equally represented.

 df1= pd.concat([df[df[label]==num_class.idxmin()],\
 df[df[label]!=num_class.idxmin()].sample(min(num_class)*(len(num_class)-1))])

Solution

  • You could try something similar to this:

    label='TARGET_VAR'
    
    g = df.groupby(label, group_keys=False)
    balanced_df = pd.DataFrame(g.apply(lambda x: x.sample(g.size().min()))).reset_index(drop=True)
    

    I believe this will produce the result you want, feel free to ask any further questions.

    Edit

    Fixed the code according to OP's suggestion.