Search code examples
pythonpandasmachine-learningrandomsample

Python - Sampling imbalanced dataset


I have a dataset with 3 classes and below are the value_counts().

Class 0 - 2000
Class 1 - 10000
Class 2 - 10000

I want to sample this dataset with the distribution as below.

Class 0 - 2000 (i.e., all rows from Class 0)
Class 1 - 4000 (i.e., twice as many rows as Class 0)
Class 2 - 4000 (i.e., twice as many rows as Class 0)

Random sampling using weights retrieves only a fraction of Class 0. Please advice.


Solution

  • If I understand you correctly:

    # Create sample data
    df = pd.DataFrame({"class": np.repeat([0, 1, 2], [2_000, 10_000, 10_000])})
    
    # The distribution matrix
    distribution = {0: 2000, 1: 4000, 2: 4000}
    
    # Take samples based on the distribution matrix
    sample = pd.concat(
        [group.sample(distribution[class_]) for class_, group in df.groupby("class")]
    )