Search code examples
pythonpandasnumpysampling

50/50 Sampling in python


I have a data set with a binary target variable that has a 4/96 percent split. I want to create a subset of the data with a 50/50 split. I would like to know the best way to do that in Python. Thanks!


Solution

  • You can groupby() your binary variable, then sample from each group.

    Generate some random data:

    >>> df = pd.DataFrame([{'variable': ''.join(random.sample('abcdefghijklmnopqrstuvwxyz', 4)), 'outcome': (random.random() > .94)} for i in range(100)])
    
    >>> print(df)
        outcome variable
    0     False     irlk
    1     False     ylmp
    2     True      przk
    3     False     xldf
    4     False     oxsp
    5     False     uytn
    6     False     ifmw
    7     True      lepa
    8     False     zfvm
    ...
    99    False     qjek
    100   False     umtw
    

    Sample as needed:

    >>> num_samples = 3
    >>> df.groupby('outcome').apply(lambda x: x.sample(num_samples))
                outcome variable
    outcome                     
    False   71    False     jdrp
            98    False     eqrj
            78    False     tnzl
    True    29     True     uvjr
            36     True     tiwn
            63     True     tabr