Search code examples
pythonpandasdatefiltermask

Validate dataframe duplicates using start and end date


I have the following dataframe of employees, represented by "PERSON_NUMBER", the benefits they have in "ELEMENT_NAME" and until when it is valid:

                       ELEMENT_NAME Element_Start_Date Element_End_Date
PERSON_NUMBER                                                          
3720081000         Standard Bonus M         2017-09-30       2025-12-31
3720081000            IE Healthcare         2016-12-01       2025-12-31
3720081000              IE Pensions         2016-09-01       2019-12-31
3720081000              IE Pensions         2019-09-01       2025-12-31
3720081000              IE Pensions         2020-03-01       2025-12-31
3720082294         IE Car Allowance         2020-03-09       2025-12-31
3720082295         Standard Bonus K         2020-03-23       2025-12-31
3720082305              IE Pensions         2020-05-25       2025-12-31
3720082305              IE Pensions         2017-05-25       2019-12-31
3720082395         Standard Bonus J         2020-03-23       2020-04-31
3720082395         Standard Bonus J         2020-05-25       2020-12-31
3720082395         Standard Bonus J         2020-09-25       2025-12-31

A person can have more than 1 element in their record but, for each unique element, we cannot have this element, at any point in time, duplicated during a certain period of time.

CASE 1:

So, for example, in this dataframe, the employee 3720081000 have the 2 first bonus ok but for "IE PENSIONS" you could see that the first time it received the "IE PENSIONS" the end date was "2019-12-31" but the following row have the same element starting from "2019-09-01" meaning from "September to December 2019" this element was in his record twice. Then, again, this element appears open in a 3rd record.

CASE 2:

But the employee 3720082305 for example would be ok as although they have the "IE Pensions" twice, start date and end date do not colide.

CASE 3:

For the employee 3720082295 you can see that they have 3 records of "Standard Bonus J". The first and second record would be ok as the benefit start from Match to April and then the other one start end of May and ends on 2025. But then, a 3rd record is created on September 2020 until 2025 and conflicts with the previous record as the person had this benefit already open.

I would expect to get all rows that then conflicts with each other.

SO, EXPECTED OUTPUT FROM THIS DATA WOULD BE:

                       ELEMENT_NAME Element_Start_Date Element_End_Date
PERSON_NUMBER                                                          
3720081000              IE Pensions         2016-09-01       2019-12-31
3720081000              IE Pensions         2019-09-01       2025-12-31
3720081000              IE Pensions         2020-03-01       2025-12-31
3720082395         Standard Bonus J         2020-05-25       2020-12-31
3720082395         Standard Bonus J         2020-09-25       2025-12-31

So, if an employee number present any error in any of the elements they have, I want to return all the rows of that element by "PERSON NUMBER"

What would be the best way to build this verification?


Solution

  • Let's start from a small correction in your sample data: One of values in Element_End_Date is 2020-04-31, which is not any valid date (April has only 30 days), so I changed it to 2020-04-30.

    I also assume that:

    • Both Element_Start_Date and Element_End_Date are of datetime type (not string).
    • PERSON_NUMBER is the index column, as in your sample.

    A preparatory step to do your task, is to define a function to get rows with overlapping date ranges, for each group of rows:

    def getOverlapping(grp):
        ind = pd.IntervalIndex.from_arrays(grp.Element_Start_Date, grp.Element_End_Date)
        ovl = [ind.overlaps(x).sum() > 1 for x in ind]
        return grp[ovl]
    

    And to get the result, apply it:

    df.set_index('ELEMENT_NAME', append=True).groupby(level=[0,1])\
        .apply(getOverlapping).reset_index(level=[2, 3], drop=True).reset_index(level=1)
    

    The result is:

                       ELEMENT_NAME Element_Start_Date Element_End_Date
    PERSON_NUMBER                                                      
    3720081000          IE Pensions         2016-09-01       2019-12-31
    3720081000          IE Pensions         2019-09-01       2025-12-31
    3720081000          IE Pensions         2020-03-01       2025-12-31
    3720082395     Standard Bonus J         2020-05-25       2020-12-31
    3720082395     Standard Bonus J         2020-09-25       2025-12-31