Imagine I have a dirty dataframe of employees with their ID, and Contract related information per country.
Some columns of this dataframe are LOV columns (depending on the country, some columns are LOV for just one country, others some or all of them) and some LOV columns are mandatory and some are not (that is just used to understand if a blank value is accepted or not).
We would need to check, using another mapping dataframe:
If the value provided is not on the list, create a new column on the main dataframe named "Errors" where it says the name of the column it errored (if more than 1 column, maybe save the name in a list on that column).
So from this dataframe:
ID Country Contract Type
1 CZ Permanent BOFF
1 ES Fixed-term .
2 CZ Contractor Front-Office
3 PT Permanent
4 PT 2022-01-01 Employee
4 PT Fixed-term Office
4 ES Employee
5 SK Permanent Employee
And using this mapping:
Country Field Values Code Mandadory
CZ Contract Permanent PE Yes
CZ Contract Fixed-term FX Yes
CZ Contract Contractor CT Yes
ES Contract Permanent PERMA No
SK Contract Permanent PER-01 Yes
SK Contract Fixed-term FIX-01 Yes
ES Type Office OFF Yes
CZ Type Back-Office BOFF Yes
CZ Type Front-Office FOFF Yes
PT Type Employee EMP No
PT Type Front-Office FRONT No
Would result in this dataframe:
ID Country Contract Type Errors
1 CZ PE BOFF ['Type']
1 ES Fixed-term . ['Contract','Type']
2 CZ CT FOFF
3 PT Permanent
4 PT 2022-01-01 FRONT ['Type']
4 PT Fixed-term Office ['Type']
4 ES Employee ['Contract','Type']
5 SK PER-01 Employee
Thank you so much for the support!
With the dataframes you provided:
import pandas as pd
df = pd.DataFrame(
{
"ID": [1, 1, 2, 3, 4, 4, 4, 5],
"Country": ["CZ", "ES", "CZ", "PT", "PT", "PT", "ES", "SK"],
"Contract": [
"Permanent",
"Fixed-term",
"Contractor",
"Permanent",
"2022-01-01",
"Fixed-term",
"",
"Permanent",
],
"Type": [
"BOFF",
".",
"Front-Office",
"",
"Employee",
"Office",
"Employee",
"Employee",
],
}
)
other = pd.DataFrame(
{
"Country": ["CZ", "CZ", "CZ", "ES", "SK", "SK", "ES", "CZ", "CZ", "PT", "PT"],
"Field": [
"Contract",
"Contract",
"Contract",
"Contract",
"Contract",
"Contract",
"Type",
"Type",
"Type",
"Type",
"Type",
],
"Values": [
"Permanent",
"Fixed-term",
"Contractor",
"Permanent",
"Permanent",
"Fixed-term",
"Office",
"Back-Office",
"Front-Office",
"Employee",
"Front-Office",
],
"Code": [
"PE",
"FX",
"CT",
"PERMA",
"PER-01",
"FIX-01",
"OFF",
"BOFF",
"FOFF",
"EMP",
"FRONT",
],
"Mandadory": [
"Yes",
"Yes",
"Yes",
"No",
"Yes",
"Yes",
"Yes",
"Yes",
"Yes",
"No",
"No",
],
}
)
Here is one way to do it with Pandas merge and apply:
# Merge dataframes on relevant columns with filter on "Contract"
new_df = pd.merge(
left=df,
right=other.loc[other["Field"] == "Contract", ["Country", "Values", "Code"]],
how="left",
left_on=["Country", "Contract"],
right_on=["Country", "Values"],
)
# Update "Contract" column
new_df["Contract"] = new_df.apply(
lambda x: x["Code"] if not pd.isna(x["Code"]) else x["Contract"], axis=1
)
# Add new "Errors" column
new_df["Errors"] = new_df.apply(
lambda x: [] if not pd.isna(x["Code"]) else ["Contract"], axis=1
)
new_df = new_df.drop(columns=["Values", "Code"])
# Merge dataframes on relevant columns with filter on "Type"
new_df = pd.merge(
left=new_df,
right=other.loc[other["Field"] == "Type", ["Country", "Values", "Code"]],
how="left",
left_on=["Country", "Type"],
right_on=["Country", "Values"],
)
# Update "Type" column
new_df["Type"] = new_df.apply(
lambda x: x["Code"] if not pd.isna(x["Code"]) else x["Type"], axis=1
)
# Update "Errors" column
new_df["Errors"] = new_df.apply(
lambda x: x["Errors"] if not pd.isna(x["Code"]) else x["Errors"] + ["Type"], axis=1
)
new_df = new_df.drop(columns=["Values", "Code"])
Then:
print(new_df)
# Output
ID Country Contract Type Errors
0 1 CZ PE BOFF [Type]
1 1 ES Fixed-term . [Contract, Type]
2 2 CZ CT FOFF []
3 3 PT Permanent [Contract, Type]
4 4 PT 2022-01-01 EMP [Contract]
5 4 PT Fixed-term Office [Contract, Type]
6 4 ES Employee [Contract, Type]
7 5 SK PER-01 Employee [Type]