Search code examples
pythonpandasdataframeaggregatedummy-variable

Pandas, merge multiple dummy variables into one column by name


I have a datafile with one VALUE column and multiple dummy variables representing TYPES. I have copied a short example below. I need the average of each type (which I can get) with a column with the named type (which I don't seem to be able to get). Pointers would be welcome.

import pandas as pd
data = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
        'manager': [1,1,1,0,0,0,0,0,0], 'foreman': [0,0,0,1,1,1,0,0,0], 
        'worker': [0,0,0,0,0,0,1,1,1]}
df = pd.DataFrame(data=data)
df

This is my input data.

salary  manager foreman worker
0   50000   1   0   0
1   45000   1   0   0
2   55000   1   0   0
3   40000   0   1   0
4   35000   0   1   0
5   45000   0   1   0
6   30000   0   0   1
7   25000   0   0   1
8   35000   0   0   1

I can get the average, like this, but not consolidate the three dummy vars into one categorical column:

print(df.groupby(['manager','foreman','worker']).mean().reset_index())

manager  foreman  worker  salary
0        0        0       1   30000
1        0        1       0   40000
2        1        0       0   50000

I would like to have something that looks like this:

need = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
        'type': ['manager','manager','manager','foreman','foreman','foreman','worker','worker','worker']}
df2 = pd.DataFrame(data=need)
df2

salary  type
0   50000   manager
1   45000   manager
2   55000   manager
3   40000   foreman
4   35000   foreman
5   45000   foreman
6   30000   worker
7   25000   worker
8   35000   worker

I can do this simple example by hand. The result looks like this, which is ultimately where I will end up:

pay = {'type' : ['manager','foreman','worker'], 'avg_pay': [50000,40000,30000]}
df1 = pd.DataFrame(data=pay)
df1

type    avg_pay
0   manager 50000
1   foreman 40000
2   worker  30000

Can't seem to find any documentation on how to "undummy" variables. How do I do this?


Solution

  • Solutions if always only one 1 per row:

    Use DataFrame.melt with ignore_index=False and var_name='type' parameter, last filter by 1 in DataFrame.loc with DataFrame.pop for remove column value:

    df = (df.melt('salary', ignore_index=False, var_name='type')
            .loc[lambda x: x.pop('value').eq(1)])
    print (df)
       salary     type
    0   50000  manager
    1   45000  manager
    2   55000  manager
    3   40000  foreman
    4   35000  foreman
    5   45000  foreman
    6   30000   worker
    7   25000   worker
    8   35000   worker
    

    Or:

    s = df.drop('salary', axis=1).stack()
    df = df[['salary']].join(s[s.eq(1)].index.to_frame().droplevel(1)[1].rename('type'))
    print (df)
       salary     type
    0   50000  manager
    1   45000  manager
    2   55000  manager
    3   40000  foreman
    4   35000  foreman
    5   45000  foreman
    6   30000   worker
    7   25000   worker
    8   35000   worker
        
    

    Solution if possible multiple 1 per rows with DataFrame.dot by columns names for separate values by ,:

    df1 = df.set_index('salary')
    df = df1.eq(1).dot(df1.columns + ',').str[:-1].reset_index(name='type')
    print (df)
       salary     type
    0   50000  manager
    1   45000  manager
    2   55000  manager
    3   40000  foreman
    4   35000  foreman
    5   45000  foreman
    6   30000   worker
    7   25000   worker
    8   35000   worker
    

    Testing data with multiple 1:

    data = {'salary' : [50000, 45000, 55000, 40000, 35000, 45000, 30000,25000,35000],
            'manager': [1,1,1,0,1,0,0,0,0], 'foreman': [0,0,0,1,1,1,0,0,0], 
            'worker': [1,0,0,1,0,0,1,1,1]}
    df = pd.DataFrame(data=data)
    print (df)
       salary  manager  foreman  worker
    0   50000        1        0       1
    1   45000        1        0       0
    2   55000        1        0       0
    3   40000        0        1       1
    4   35000        1        1       0
    5   45000        0        1       0
    6   30000        0        0       1
    7   25000        0        0       1
    8   35000        0        0       1
    
    df1 = df.set_index('salary')
    df = df1.eq(1).dot(df1.columns + ',').str[:-1].reset_index(name='type')
    print (df)
       salary             type
    0   50000   manager,worker
    1   45000          manager
    2   55000          manager
    3   40000   foreman,worker
    4   35000  manager,foreman
    5   45000          foreman
    6   30000           worker
    7   25000           worker
    8   35000           worker