I am creating a simple code which allows to down-sample a dataframe when your target variable has more than 2 classes.
Let df
be our arbitrary dataset and 'TARGET_VAR'
a categorical variable with more than 2 classes.
import pandas as pd
label='TARGET_VAR' #define the target variable
num_class=df[label].value_counts() #creates list with the count of each class value
temp=pd.DataFrame() #create empty dataframe to be filled up
for cl in num_class.index: #loop through classes
#iteratively downsample every class according to the smallest
#class 'min(num_class)' and append it to the dataframe.
temp=temp.append(df[df[label]==cl].sample(min(num_class)))
df=temp #redefine initial dataframe as the subsample one
del temp, num_class #delete temporary dataframe
Now I was wondering, is there a way to do this in a more refined way? e.g. without having to create the temporary dataset? I tried to figure out a way to "vectorize" the operation for multiple classes but didn't get anywhere. Below is my idea, which can easily be implemented for 2 classes but I have no idea how to expand it to the multiple classes case.
This works perfectly if you have 2 classes
df= pd.concat([df[df[label]==num_class.idxmin()],\
df[df[label]!=num_class.idxmin()].sample(min(num_class))])
This allows you to pick the right amount of observations for the other classes but the classes will not necessarily be equally represented.
df1= pd.concat([df[df[label]==num_class.idxmin()],\
df[df[label]!=num_class.idxmin()].sample(min(num_class)*(len(num_class)-1))])
You could try something similar to this:
label='TARGET_VAR'
g = df.groupby(label, group_keys=False)
balanced_df = pd.DataFrame(g.apply(lambda x: x.sample(g.size().min()))).reset_index(drop=True)
I believe this will produce the result you want, feel free to ask any further questions.
Fixed the code according to OP's suggestion.