I have a data frame with employees and all the roles that they are able to do.
`
Employees ID Brand_Manager Payroll_Manager Accountant Auditor
0 Jessi 1A 1 0 1 0
1 Lara 1B 1 0 0 1
2 Mike 1C 1 0 0 0
3 Artur 1D 1 0 0 0
4 James 2A 1 0 0 0
5 Claudia 3B 1 0 0 0
6 Zuzska 4C 1 1 0 1
7 Bartz 2B 1 1 0 0
8 Alexa 3B 1 1 0 0
`
To make work the program that I want to apply, I need to split the data and create new rows for the same person for each role (value=1). The rest of the values for the rest of the roles will become 0. The codes work well when using append(), however this example contain only 8 employees and 4 roles. I need to do the same with a lot more employees and almost 100 extra roles, which will create a very long script.
I have done this: First, select all the employees that have more than one role
col_list=df.columns
df['many'] = df[col_list].sum(axis=1)
df_single = df[ df['many'] == 1 ]
df_many = df[ df['many'] >= 2 ]
Then create lists and append:
lststaff = list ( df_many.Employees)
lstEmployees = []
lstID = []
lstBrand_Manager = []
lstPayroll_Manager = []
lstAccountant = []
lstAuditor = []
Loop through the names
for i in lststaff:
ID = str ( df_many.loc [ df_many['Employees'] == i, 'ID' ].tolist()[0] )
Brand_Manager = ( df_many.loc [ df_many['Employees'] == i, 'Brand_Manager'].astype(int) )
Brand_Manager = np.array(Brand_Manager)
if ( Brand_Manager == 1 ).any():
lstEmployees.append ( i + '_Brand_Manager' )
lstAccountant.append (0)
lstBrand_Manager.append(1)
lstAuditor.append(0)
lstID.append (ID)
lstPayroll_Manager.append(0)
Accountant = ( df_many.loc [ df_many['Employees'] == i, 'Accountant'].astype(int) )
Accountant = np.array(Accountant)
if ( Accountant == 1 ).any():
lstEmployees.append ( i + '_Accountant' )
lstAccountant.append (1)
lstBrand_Manager.append(0)
lstAuditor.append(0)
lstID.append (ID)
lstPayroll_Manager.append(0)
Auditor = ( df_many.loc [ df_many['Employees'] == i, 'Auditor'].astype(int) )
Auditor = np.array(Auditor)
if ( Auditor == 1 ).any():
lstEmployees.append ( i + '_Auditor' )
lstAccountant.append (0)
lstBrand_Manager.append(0)
lstAuditor.append(1)
lstID.append (ID)
lstPayroll_Manager.append(0)
Payroll_Manager = ( df_many.loc [ df_many['Employees'] == i, 'Payroll_Manager'].astype(int) )
Payroll_Manager = np.array(Payroll_Manager)
if ( Payroll_Manager == 1 ).any():
lstEmployees.append ( i + '_Payroll_Manager' )
lstAccountant.append (0)
lstBrand_Manager.append(0)
lstAuditor.append(0)
lstID.append (ID)
lstPayroll_Manager.append(1)
final_df = pd.DataFrame ( {
"Employees" : lstEmployees
,"ID" : lstID
,"Brand_Manager" :lstBrand_Manager
,"Accountant" :lstAccountant
,"Auditor" : lstAuditor
,"Payroll_Manager" : lstPayroll_Manager
}
)
final_df
The codes works well, however if I have to add 100 more roles. I will add 100 more lists and the amount of lists to be appended would be crazy.... Is there any other way of doing it with a function or for loops?
The output is as follow:
Employees ID Brand_Manager Accountant Auditor Payroll_Manager
0 Jessi_Brand_Manager 1A 1 0 0 0
1 Jessi_Accountant 1A 0 1 0 0
2 Lara_Brand_Manager 1B 1 0 0 0
3 Lara_Auditor 1B 0 0 1 0
4 Zuzska_Brand_Manager 4C 1 0 0 0
5 Zuzska_Auditor 4C 0 0 1 0
6 Zuzska_Payroll_Manager 4C 0 0 0 1
7 Bartz_Brand_Manager 2B 1 0 0 0
8 Bartz_Payroll_Manager 2B 0 0 0 1
9 Alexa_Brand_Manager 3B 1 0 0 0
10 Alexa_Payroll_Manager 3B 0 0 0 1
Here's one approach. pandas
gives a "SettingWithCopy" warning, but I believe this script will always result in the expected behavior.
import numpy as np
import pandas as pd
#input dataframe
df = pd.DataFrame({'Employees': {0: 'Jessi', 1: 'Lara', 2: 'Mike', 3: 'Artur', 4: 'James', 5: 'Claudia', 6: 'Zuzska', 7: 'Bartz', 8: 'Alexa'}, 'ID': {0: '1A', 1: '1B', 2: '1C', 3: '1D', 4: '2A', 5: '3B', 6: '4C', 7: '2B', 8: '3B'}, 'Brand_Manager': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}, 'Payroll_Manager': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 1, 7: 1, 8: 1}, 'Accountant': {0: 1, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0}, 'Auditor': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0, 8: 0}})
# isolate job columns
job_cols = df.columns.drop(['Employees','ID'])
num_jobs = len(job_cols)
# drop employees with only one job
df_in = df.loc[df[job_cols].sum(axis = 1)>=2,:]
# information processing
def get_nums(row):
return np.eye(num_jobs, dtype = int)[row.astype(bool)]
def get_labels(row):
return job_cols[row.astype(bool)]
df_proc = df_in[['Employees','ID']]
df_proc.loc[:,'all'] = df_in[job_cols].apply(get_nums,axis = 1)
df_proc = df_proc.explode('all').reset_index(drop=True)
df_proc[job_cols] = np.vstack(df_proc['all'])
df_proc['jobs'] = df_in[job_cols].apply(get_labels,axis = 1).explode().reset_index(drop=True)
# generate output dataframe
df_out = df_proc[['Employees','ID',*job_cols]]
df_out.loc[:,'Employees'] = df_proc['Employees'] + '_' + df_proc['jobs']
Resulting dataframe df_out
:
Employees ID Brand_Manager Payroll_Manager Accountant \
0 Jessi_Brand_Manager 1A 1 0 0
1 Jessi_Accountant 1A 0 0 1
2 Lara_Brand_Manager 1B 1 0 0
3 Lara_Auditor 1B 0 0 0
4 Zuzska_Brand_Manager 4C 1 0 0
5 Zuzska_Payroll_Manager 4C 0 1 0
6 Zuzska_Auditor 4C 0 0 0
7 Bartz_Brand_Manager 2B 1 0 0
8 Bartz_Payroll_Manager 2B 0 1 0
9 Alexa_Brand_Manager 3B 1 0 0
10 Alexa_Payroll_Manager 3B 0 1 0
Auditor
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 0
10 0
Intermediate dataframe df_proc before generating df_out, for reference:
Employees ID all Brand_Manager Payroll_Manager Accountant \
0 Jessi 1A [1, 0, 0, 0] 1 0 0
1 Jessi 1A [0, 0, 1, 0] 0 0 1
2 Lara 1B [1, 0, 0, 0] 1 0 0
3 Lara 1B [0, 0, 0, 1] 0 0 0
4 Zuzska 4C [1, 0, 0, 0] 1 0 0
5 Zuzska 4C [0, 1, 0, 0] 0 1 0
6 Zuzska 4C [0, 0, 0, 1] 0 0 0
7 Bartz 2B [1, 0, 0, 0] 1 0 0
8 Bartz 2B [0, 1, 0, 0] 0 1 0
9 Alexa 3B [1, 0, 0, 0] 1 0 0
10 Alexa 3B [0, 1, 0, 0] 0 1 0
Auditor jobs
0 0 Brand_Manager
1 0 Accountant
2 0 Brand_Manager
3 1 Auditor
4 0 Brand_Manager
5 0 Payroll_Manager
6 1 Auditor
7 0 Brand_Manager
8 0 Payroll_Manager
9 0 Brand_Manager
10 0 Payroll_Manager