Search code examples
pythonpandasdataframefilterlambda

How to confirm that sequences of ids in a column comply to a given rule?


I have a dataframe that contains the information of different "contract types" in a company: Employee, Consultant, Non Employee.

Each row represents a contract and one person (meaning someone with same first name and last name) can have more than 1 contract, and the contract can be either one of the 3 values mentioned, and there is an "ID" column that identifies the "contract number" of the person. This is the dataframe:

FirstName   LastName    Employee Type   ID
Paulo       Cortez      Employee         2
Paulo       Cortez      Employee         1
Paulo       Cortez      Consultant       1
Paulo       Cortez      Non Worker       1
Paulo       Cortez      Employee         3
Felipe      Cardoso     Employee         np.nan
Vera        Lucia       Employee         2
Pedro       Machado     Consultant       1
Lucio       Mario       Employee         1
Lucio       Mario       Employee         1
Lucio       Mario       Consultant       1
Maria       Hebbe       Employee         1
Maria       Hebbe       Consultant       1
Maria       Hebbe       Consultant       1

The Logic I would need to validate is the following:

  • Per each person (same first name and last name) I would need to analyze how many times this person appears in this dataframe. Then, per each row, and per each "Employee Type", the "ID" column needs to be checked as following:

    • The ID will be 1 for each first "Employee Type" for a person, meaning if the person appears 3 times in the data one for each "Employee Type" (Employee, Consultant, Non Worker), the "ID" column would be 1 (here does not matter where the record is in the table). Then, per each subsequential contract of the same type, the ID should add 1.

I need a way to filter this dataframe by fetching all person not following this logic (even if for some Employee type contracts it is correct, example being Lucio Mario that "Employee Type" consultant is correct but "Employee" is not), resulting on this dataframe:

FirstName   LastName    Employee Type   ID
Felipe      Cardoso     Employee         np.nan
Vera        Lucia       Employee         2
Lucio       Mario       Employee         1
Lucio       Mario       Employee         1
Lucio       Mario       Consultant       1
Maria       Hebbe       Employee         1
Maria       Hebbe       Consultant       1
Maria       Hebbe       Consultant       1

What would be the best way to achieve it?


Solution

  • With the dataframe you provided:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame(
        {
            "FirstName": [
                "Paulo",
                "Paulo",
                "Paulo",
                "Paulo",
                "Paulo",
                "Felipe",
                "Vera",
                "Pedro",
                "Lucio",
                "Lucio",
                "Lucio",
                "Maria",
                "Maria",
                "Maria",
            ],
            "LastName": [
                "Cortez",
                "Cortez",
                "Cortez",
                "Cortez",
                "Cortez",
                "Cardoso",
                "Lucia",
                "Machado",
                "Mario",
                "Mario",
                "Mario",
                "Hebbe",
                "Hebbe",
                "Hebbe",
            ],
            "Employee Type": [
                "Employee",
                "Employee",
                "Consultant",
                "Non Worker",
                "Employee",
                "Employee",
                "Employee",
                "Consultant",
                "Employee",
                "Employee",
                "Consultant",
                "Employee",
                "Consultant",
                "Consultant",
            ],
            "ID": [
                "2",
                "1",
                "1",
                "1",
                "3",
                np.nan,
                "2",
                "1",
                "1",
                "1",
                "1",
                "1",
                "1",
                "1",
            ],
        }
    )
    

    Here is one way to it with Pandas groupby:

    tmp = (
        df.fillna(0)
        .groupby(["FirstName", "LastName", "Employee Type"])
        .agg(["count", list])
        .droplevel(0, axis=1)
        .reset_index()
        .rename(columns={"list": "ID"})
        .pipe(lambda df_: df_[df_["count"] != df_["ID"].apply(max).astype(int)])
    )
    
    
    new_df = df[
        (df["FirstName"].isin(tmp["FirstName"])) & (df["LastName"].isin(tmp["LastName"]))
    ].reset_index(drop=True)
    

    Then:

    print(new_df)
    # Output
      FirstName LastName Employee Type   ID
    0    Felipe  Cardoso      Employee  NaN
    1      Vera    Lucia      Employee    2
    2     Lucio    Mario      Employee    1
    3     Lucio    Mario      Employee    1
    4     Lucio    Mario    Consultant    1
    5     Maria    Hebbe      Employee    1
    6     Maria    Hebbe    Consultant    1
    7     Maria    Hebbe    Consultant    1