Search code examples
pythonpandasdataframepandas-groupbyexperimental-design

python pandas: assign control vs. treatment groupings randomly based on %


I am working on an experiment design, where I need to split a dataframe df into a control and treatment group by % by pre-existing groupings.

This is the dataframe df:

df.head()

customer_id | Group | many other columns
ABC             1
CDE             1
BHF             2
NID             1
WKL             2
SDI             2

pd.pivot_table(df,index=['Group'],values=["customer_id"],aggfunc=lambda x: len(x.unique()))

Group 1  : 55394
Group 2  : 34889

Now I need to add a column labeled "Flag" into the df. For Group 1, I want to randomly assign 50% "Control" and 50% "Test". For Group 2, I want to randomly assign 40% "Control" and 60% "Test".

The output I am looking for:

customer_id | Group | many other columns | Flag
ABC             1                          Test
CDE             1                          Control
BHF             2                          Test
NID             1                          Test
WKL             2                          Control
SDI             2                          Test

Solution

  • we can use numpy.random.choice() method:

    In [160]: df['Flag'] = \
         ...: df.groupby('Group')['customer_id']\
         ...:   .transform(lambda x: np.random.choice(['Control','Test'], len(x), 
                                                      p=[.5,.5] if x.name==1 else [.4,.6]))
         ...:
    
    In [161]: df
    Out[161]:
      customer_id  Group     Flag
    0         ABC      1  Control
    1         CDE      1     Test
    2         BHF      2     Test
    3         NID      1  Control
    4         WKL      2     Test
    5         SDI      2  Control
    

    UPDATE:

    In [8]: df
    Out[8]:
      customer_id  Group
    0         ABC      1
    1         CDE      1
    2         BHF      2
    3         NID      1
    4         WKL      2
    5         SDI      2
    6         XXX      3
    7         XYZ      3
    8         XXX      3
    
    In [9]: d = {1:[.5,.5], 2:[.4,.6], 3:[.2,.8]}
    
    In [10]: df['Flag'] = \
        ...: df.groupby('Group')['customer_id'] \
        ...:   .transform(lambda x: np.random.choice(['Control','Test'], len(x), p=d[x.name]))
        ...:
    
    In [11]: df
    Out[11]:
      customer_id  Group     Flag
    0         ABC      1     Test
    1         CDE      1     Test
    2         BHF      2  Control
    3         NID      1  Control
    4         WKL      2  Control
    5         SDI      2     Test
    6         XXX      3     Test
    7         XYZ      3     Test
    8         XXX      3     Test