Search code examples
pythonpandasdataframescikit-learnsklearn-pandas

Dataset upsampling using pandas and sklearn - Python


I have a dataset with one class being very imbalanced (190 records vs 14810) based on the 'relevance' column. So, I tried to upsample it which worked; but the issue is that I have other category of classes in another column (1000 records per each class) and when I simply upsample based on 'relevance' column, these classes become imbalanced. is there a way to upsample 'relevance' keeping this ratio of classes in another column?

# Creating a dataset with a class minority

df_minority = df[df['relevance'] == 1]

# Creating a dataset with the other class

df_rest = df[df['relevance'] != 1] 

# Upsample the minority class

df_1_upsampled = resample(df_minority,random_state=SEED,n_samples=14810,replace=True)

# Concatenate the upsampled dataframe

df_upsampled = pd.concat([df_1_upsampled,df_rest])

Sample dataset:

  relevance   class   2   3   4   5  
          1       A  40  24  11  50
          1       A  60  20  19  60
          0       C  15  57  15  60
          0       B  12  50  15  43 
          0       B  90   8  32  80
          0       C  74   8  21  34

So, the goal is to make the number of 'relevance' classes equal, keeping the 1:1:1 ratio of the 'class' category.


Solution

  • Here is a way to do it per class. Note that I'm not sure if this will not bias any model after, not enough experience here. First let's create a dummy data that is closer to your real data.

    # dummy data
    np.random.seed(0)
    df = pd.DataFrame({
             'relevance':np.random.choice(a=[0]*14810+[1]*190,size=15000, replace=False), 
             'class':list('ABCDEFGHIKLMNOP')*1000,
             2 : np.random.randint(0,100, 15000), 3 : np.random.randint(0,100, 15000),
             4 : np.random.randint(0,100, 15000), 5 : np.random.randint(0,100, 15000),
    })
    

    Just some check on class-relevance, you will need this info anyway. You have all class with 1000 samples and each class has a different number of relevance=1

    ct = pd.crosstab(df['class'], df['relevance'])
    print(ct.head())
    # relevance    0   1
    # class             
    # A          983  17
    # B          982  18
    # C          990  10
    # D          993   7
    # E          993   7
    

    Now you can calculate the number of upsample needed per class. Note that we can define this several ways, and especially change 1000 per any number.

    nb_upsample = (1000*ct[0].mean()/ct[0]).astype(int)
    print(nb_upsample.head())
    # class
    # A    1004
    # B    1005
    # C     997
    # D     994
    # E     994
    # Name: 0, dtype: int32
    

    Now you can upsample per class

    df_1_upsampled = (
        df_minority.groupby(['class'])
          .apply(lambda x: resample(x, random_state=1, replace=True,
                                    n_samples=nb_upsample[x.name]))
          .reset_index(drop=True)
    )
    print(df_1_upsampled['class'].value_counts().head())
    # B    1005
    # A    1004
    # L    1004
    # M    1003
    # H    1001
    # Name: class, dtype: int64
    

    Finally, concat and check the ratio class and relevance

    df_upsampled = pd.concat([df_1_upsampled,df_rest])
    print(df_upsampled['class'].value_counts().head()) #same ratio
    # A    1987
    # B    1987
    # C    1987
    # D    1987
    # E    1987
    # Name: class, dtype: int64
    print(df_upsampled['relevance'].value_counts()) # almost same relevance number
    # 1    14995 #this number is affected by the 1000 in nb_upsample
    # 0    14810
    # Name: relevance, dtype: int64
    

    You can see there is more relevance=1 now. What you can do is change 1000 at the line defining nb_upsample by any number you want. You could also use nb_upsample = (ct[0].mean()**2/ct[0]).astype(int) that would balance a bit more both relevance categories.