I have a dataframe that contains the information of different "contract types" in a company: Employee, Consultant, Non Employee.
Each row represents a contract and one person (meaning someone with same first name and last name) can have more than 1 contract, and the contract can be either one of the 3 values mentioned, and there is an "ID" column that identifies the "contract number" of the person. This is the dataframe:
FirstName LastName Employee Type ID
Paulo Cortez Employee 2
Paulo Cortez Employee 1
Paulo Cortez Consultant 1
Paulo Cortez Non Worker 1
Paulo Cortez Employee 3
Felipe Cardoso Employee np.nan
Vera Lucia Employee 2
Pedro Machado Consultant 1
Lucio Mario Employee 1
Lucio Mario Employee 1
Lucio Mario Consultant 1
Maria Hebbe Employee 1
Maria Hebbe Consultant 1
Maria Hebbe Consultant 1
The Logic I would need to validate is the following:
Per each person (same first name and last name) I would need to analyze how many times this person appears in this dataframe. Then, per each row, and per each "Employee Type", the "ID" column needs to be checked as following:
I need a way to filter this dataframe by fetching all person not following this logic (even if for some Employee type contracts it is correct, example being Lucio Mario that "Employee Type" consultant is correct but "Employee" is not), resulting on this dataframe:
FirstName LastName Employee Type ID
Felipe Cardoso Employee np.nan
Vera Lucia Employee 2
Lucio Mario Employee 1
Lucio Mario Employee 1
Lucio Mario Consultant 1
Maria Hebbe Employee 1
Maria Hebbe Consultant 1
Maria Hebbe Consultant 1
What would be the best way to achieve it?
With the dataframe you provided:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"FirstName": [
"Paulo",
"Paulo",
"Paulo",
"Paulo",
"Paulo",
"Felipe",
"Vera",
"Pedro",
"Lucio",
"Lucio",
"Lucio",
"Maria",
"Maria",
"Maria",
],
"LastName": [
"Cortez",
"Cortez",
"Cortez",
"Cortez",
"Cortez",
"Cardoso",
"Lucia",
"Machado",
"Mario",
"Mario",
"Mario",
"Hebbe",
"Hebbe",
"Hebbe",
],
"Employee Type": [
"Employee",
"Employee",
"Consultant",
"Non Worker",
"Employee",
"Employee",
"Employee",
"Consultant",
"Employee",
"Employee",
"Consultant",
"Employee",
"Consultant",
"Consultant",
],
"ID": [
"2",
"1",
"1",
"1",
"3",
np.nan,
"2",
"1",
"1",
"1",
"1",
"1",
"1",
"1",
],
}
)
Here is one way to it with Pandas groupby:
tmp = (
df.fillna(0)
.groupby(["FirstName", "LastName", "Employee Type"])
.agg(["count", list])
.droplevel(0, axis=1)
.reset_index()
.rename(columns={"list": "ID"})
.pipe(lambda df_: df_[df_["count"] != df_["ID"].apply(max).astype(int)])
)
new_df = df[
(df["FirstName"].isin(tmp["FirstName"])) & (df["LastName"].isin(tmp["LastName"]))
].reset_index(drop=True)
Then:
print(new_df)
# Output
FirstName LastName Employee Type ID
0 Felipe Cardoso Employee NaN
1 Vera Lucia Employee 2
2 Lucio Mario Employee 1
3 Lucio Mario Employee 1
4 Lucio Mario Consultant 1
5 Maria Hebbe Employee 1
6 Maria Hebbe Consultant 1
7 Maria Hebbe Consultant 1