Imagine I have the following dirty data of employee information of their contracts across countries (df1
):
ID Country Name Job Date Grade
1 CZ John Office 2021-01-01 Senior
1 SK John . 2021-01-01 Assistant
2 AE Peter Carpinter 2000-05-03
3 PE Marcia Cleaner 1989-11-11 ERROR!
3 FR Marcia Assistant 1978-01-05 High
3 FR Marcia 1999-01-01 Senior
I need to look into a LOV mapping table and on it, each country have different (or same) LOV columns that would replace the value provided by a code. For each country then, it would check if the column is in the LOV mapping for that country and, if the value exist in the "Values" column, replace to the corresponding code. If not, just leave the same value.
So using this mapping (df2
):
Country Field Values Code
US Job Back BA
US Job Front FR
US Job Office OFF
CZ Job Office CZ_OFF
CZ Job Field CZ_Fil
SK Job All ALL
FR Job Assistant AST
AE Job Carpinter CAR
AE Job Carpinter CAR
CZ Grade Senior S
CZ Grade Junior J
SK Grade M1 M1
FR Grade Low L
FR Grade Mid M1
FR Grade High H
Would result in the following dataframe:
ID Country Name Job Date Grade
1 CZ John CZ_OFF 2021-01-01 S
1 SK John . 2021-01-01 M1
2 AE Peter CAR 2000-05-03
3 PE Marcia Cleaner 1989-11-11 ERROR!
3 FR Marcia AST 1978-01-05 H
3 FR Marcia 1999-01-01 Senior
Thank you so much for the support!
Using a melt
+merge
+pivot
+combine_first
:
cols = df1.columns.difference(df2['Field'].unique())
out = (
df1.melt(cols, var_name='Field', value_name='Values', ignore_index=False)
.reset_index()
.merge(df2.drop_duplicates())
.pivot(index=cols.union(['index']), columns='Field', values='Code')
.reset_index(list(cols))
.combine_first(df1)[df1.columns]
)
Output:
ID Country Name Job Date Grade
0 1 CZ John CZ_OFF 2021-01-01 S
1 1 SK John . 2021-01-01 Assistant
2 2 AE Peter CAR 2000-05-03 NaN
3 3 PE Marcia Cleaner 1989-11-11 ERROR!
4 3 FR Marcia AST 1978-01-05 H
5 3 FR Marcia NaN 1999-01-01 Senior
Alternative using a loop (in place modification of df1
), and groupby
to facilitate the selection of the rows to merge
:
g = df2.groupby('Field')
for c in df1.columns.intersection(df2['Field'].unique()):
df1[c] = df1.merge(g.get_group(c).drop(columns='Field')
.rename(columns={'Values': c}), how='left'
)['Code'].fillna(df1[c])