I have a data set with a binary target variable that has a 4/96 percent split. I want to create a subset of the data with a 50/50 split. I would like to know the best way to do that in Python. Thanks!
You can groupby()
your binary variable, then sample from each group.
Generate some random data:
>>> df = pd.DataFrame([{'variable': ''.join(random.sample('abcdefghijklmnopqrstuvwxyz', 4)), 'outcome': (random.random() > .94)} for i in range(100)])
>>> print(df)
outcome variable
0 False irlk
1 False ylmp
2 True przk
3 False xldf
4 False oxsp
5 False uytn
6 False ifmw
7 True lepa
8 False zfvm
...
99 False qjek
100 False umtw
Sample as needed:
>>> num_samples = 3
>>> df.groupby('outcome').apply(lambda x: x.sample(num_samples))
outcome variable
outcome
False 71 False jdrp
98 False eqrj
78 False tnzl
True 29 True uvjr
36 True tiwn
63 True tabr