Search code examples
pandasdataframerandomconditional-statements

Create pandas dataframe column with random conditional numbers


I have created the following pandas dataframe.

import pandas as pd
import numpy as np

ds = {'col1' : [1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2],
      'col2' : [0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1]}

data = pd.DataFrame(data=ds)

which looks like this:

print(data)

    col1  col2
0      1     0
1      1     0
2      1     0
3      1     0
4      1     0
5      1     0
6      1     0
7      1     0
8      1     0
9      1     0
10     1     0
11     1     0
12     1     0
13     1     0
14     2     1
15     2     1
16     2     1
17     2     1
18     2     1
19     2     1
20     2     1
21     2     1
22     2     1
23     2     1
24     2     1
25     2     1
26     2     1
27     2     1

I need to create a new column (called col3) subject to the following conditions:

  1. when col1 = 1, there are 14 records for which col2 = 0. The new column (i.e. col3), needs to have 50% (of exactly those 14 records) of the values equal to col2 (randomly distributed across the 14 records) and the remaining 50% equal to 1.

  2. when col1 = 2, there are 14 records for which col2 = 1. The new column (i.e. col3), needs to have 50% (of exactly those 14 records) of the values equal to col2 (randomly distributed across the 14 records) and the remaining 50% equal to 0.

So, the resulting dataset would look like this (bear in mind that the location - or record - of the values in col3 is randomly assigned):

enter image description here

Does anyone know the python code to produce such dataframe?


Solution

  • groupby + sample

    # take a sample of 50% from col2 per unique value in col1
    data['col3'] = data.groupby('col1')['col2'].sample(frac=.5)
    
    # fill the remaining 50% using a predefined mapping of col1 value
    data['col3'] = data['col3'].fillna(data['col1'].map({1: 1, 2: 0}), downcast='infer')
    

    Result

        col1  col2  col3
    0      1     0     1
    1      1     0     0
    2      1     0     0
    3      1     0     0
    4      1     0     0
    5      1     0     0
    6      1     0     0
    7      1     0     1
    8      1     0     0
    9      1     0     1
    10     1     0     1
    11     1     0     1
    12     1     0     1
    13     1     0     1
    14     2     1     1
    15     2     1     0
    16     2     1     1
    17     2     1     0
    18     2     1     0
    19     2     1     1
    20     2     1     1
    21     2     1     0
    22     2     1     1
    23     2     1     0
    24     2     1     0
    25     2     1     1
    26     2     1     1
    27     2     1     0