I have a dataset with model scores ranging from 0 to 1. The table looks like below:
| Score |
| ----- |
| 0.55 |
| 0.67 |
| 0.21 |
| 0.05 |
| 0.91 |
| 0.15 |
| 0.33 |
| 0.47 |
I want to randomly divide these scores into 4 groups. control
, treatment 1
, treatment 2
, treatment 3
. control
group should have 20% of the observations and the rest 80% has to be divided into the other 3 equal sized groups. However, i want the distribution of scores in each group to be the same. How can i solve this using python?
PS: This is just a representation of the actual table, but it will have a lot more observations than this.
You can use numpy.random.choice
to set random groups with defined probabilities, then groupby
to split the dataframe:
import numpy as np
group = np.random.choice(['control', 'treatment 1', 'treatment 2', 'treatment 3'],
size=len(df),
p=[.2, .8/3, .8/3, .8/3])
dict(list(df.groupby(pd.Series(group, index=df.index))))
possible output (each value in the dictionary is a DataFrame):
{'control': Score
2 0.21
5 0.15,
'treatment 1': Score
7 0.47,
'treatment 2': Score
1 0.67
3 0.05,
'treatment 3': Score
0 0.55
4 0.91
6 0.33}