I have created the following pandas dataframe.
import pandas as pd
import numpy as np
ds = {'col1' : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
'col2' : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
data = pd.DataFrame(data=ds)
which looks like this:
print(data)
col1 col2
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 1 0
11 1 0
12 1 0
13 1 0
14 2 1
15 2 1
16 2 1
17 2 1
18 2 1
19 2 1
20 2 1
21 2 1
22 2 1
23 2 1
24 2 1
25 2 1
26 2 1
27 2 1
I need to create a new column (called col3
) subject to the following conditions:
when col1 = 1
, there are 14 records for which col2 = 0. The new column (i.e. col3
), needs to have 50% (of exactly those 14 records) of the values equal to col2
(randomly distributed across the 14 records) and the remaining 50% equal to 1.
when col1 = 2
, there are 14 records for which col2 = 1
. The new column (i.e. col3
), needs to have 50% (of exactly those 14 records) of the values equal to col2
(randomly distributed across the 14 records) and the remaining 50% equal to 0.
So, the resulting dataset would look like this (bear in mind that the location - or record - of the values in col3
is randomly assigned):
Does anyone know the python code to produce such dataframe?
groupby
+ sample
# take a sample of 50% from col2 per unique value in col1
data['col3'] = data.groupby('col1')['col2'].sample(frac=.5)
# fill the remaining 50% using a predefined mapping of col1 value
data['col3'] = data['col3'].fillna(data['col1'].map({1: 1, 2: 0}), downcast='infer')
Result
col1 col2 col3
0 1 0 1
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
5 1 0 0
6 1 0 0
7 1 0 1
8 1 0 0
9 1 0 1
10 1 0 1
11 1 0 1
12 1 0 1
13 1 0 1
14 2 1 1
15 2 1 0
16 2 1 1
17 2 1 0
18 2 1 0
19 2 1 1
20 2 1 1
21 2 1 0
22 2 1 1
23 2 1 0
24 2 1 0
25 2 1 1
26 2 1 1
27 2 1 0