Search code examples
pythonscikit-learnclassificationsampling

For a binary classification, how can I sample data by column so that 2/3 of the data contains zeros and 1/3 contains ones?


I have a large dataset containing four columns and the third column contains the binary label (a value either 0 or 1). This dataset is imbalanced - it contains much more zeros than ones. The data looks like:

3   5   0   0.4
4   5   1   0.1
5   13  0   0.5
6   10  0   0.8
7   25  1   0.3
:   :   :   :

I know that I can obtain a balanced subset containing 50% zeros and 50% ones by for example:

df_sampled = df.groupby(df.iloc[:,2]).sample(n=20000, random_state=1)

But how I can amend the one-liner given above to change the ratio of zeros and ones? For example how can I sample this data (by the third column) so that 2/3 of the data contains zeros and 1/3 contains ones?


Solution

  • This is a possible solution:

    n_samples = 90000 # total number of samples
    
    df_sampled = pd.concat(
        [group.sample(n=int(n_samples * 2 / 3)) if label == 0
         else group.sample(n=int(n_samples * 1 / 3))
         for label, group in df.groupby(df.iloc[:, 2])]
    )
    

    A similar solution would be:

    n_samples = 90000 # total number of samples
    ratios = [2 / 3, 1 / 3]
    
    df_sampled = pd.concat(
        [group.sample(n=int(n_samples * ratios[label]))
         for label, group in df.groupby(df.iloc[:, 2])]
    )
    

    Here I'm basically applying a different function to different groups.