Search code examples
pythonpandasdistribution

Random sample the model scores into 4 groups with a similar distribution in python


I have a dataset with model scores ranging from 0 to 1. The table looks like below:

| Score |
| ----- |
| 0.55  |
| 0.67  |
| 0.21  |
| 0.05  |
| 0.91  |
| 0.15  |
| 0.33  |
| 0.47  |

I want to randomly divide these scores into 4 groups. control, treatment 1, treatment 2, treatment 3. control group should have 20% of the observations and the rest 80% has to be divided into the other 3 equal sized groups. However, i want the distribution of scores in each group to be the same. How can i solve this using python?

PS: This is just a representation of the actual table, but it will have a lot more observations than this.


Solution

  • You can use numpy.random.choice to set random groups with defined probabilities, then groupby to split the dataframe:

    import numpy as np
    group = np.random.choice(['control', 'treatment 1', 'treatment 2', 'treatment 3'],
                              size=len(df),
                              p=[.2, .8/3, .8/3, .8/3])
    
    dict(list(df.groupby(pd.Series(group, index=df.index))))
    

    possible output (each value in the dictionary is a DataFrame):

    {'control':    Score
     2   0.21
     5   0.15,
     'treatment 1':    Score
     7   0.47,
     'treatment 2':    Score
     1   0.67
     3   0.05,
     'treatment 3':    Score
     0   0.55
     4   0.91
     6   0.33}