I have the following dataframe of employees, represented by "PERSON_NUMBER", the benefits they have in "ELEMENT_NAME" and until when it is valid:
ELEMENT_NAME Element_Start_Date Element_End_Date
PERSON_NUMBER
3720081000 Standard Bonus M 2017-09-30 2025-12-31
3720081000 IE Healthcare 2016-12-01 2025-12-31
3720081000 IE Pensions 2016-09-01 2019-12-31
3720081000 IE Pensions 2019-09-01 2025-12-31
3720081000 IE Pensions 2020-03-01 2025-12-31
3720082294 IE Car Allowance 2020-03-09 2025-12-31
3720082295 Standard Bonus K 2020-03-23 2025-12-31
3720082305 IE Pensions 2020-05-25 2025-12-31
3720082305 IE Pensions 2017-05-25 2019-12-31
3720082395 Standard Bonus J 2020-03-23 2020-04-31
3720082395 Standard Bonus J 2020-05-25 2020-12-31
3720082395 Standard Bonus J 2020-09-25 2025-12-31
A person can have more than 1 element in their record but, for each unique element, we cannot have this element, at any point in time, duplicated during a certain period of time.
CASE 1:
So, for example, in this dataframe, the employee 3720081000 have the 2 first bonus ok but for "IE PENSIONS" you could see that the first time it received the "IE PENSIONS" the end date was "2019-12-31" but the following row have the same element starting from "2019-09-01" meaning from "September to December 2019" this element was in his record twice. Then, again, this element appears open in a 3rd record.
CASE 2:
But the employee 3720082305 for example would be ok as although they have the "IE Pensions" twice, start date and end date do not colide.
CASE 3:
For the employee 3720082295 you can see that they have 3 records of "Standard Bonus J". The first and second record would be ok as the benefit start from Match to April and then the other one start end of May and ends on 2025. But then, a 3rd record is created on September 2020 until 2025 and conflicts with the previous record as the person had this benefit already open.
I would expect to get all rows that then conflicts with each other.
SO, EXPECTED OUTPUT FROM THIS DATA WOULD BE:
ELEMENT_NAME Element_Start_Date Element_End_Date
PERSON_NUMBER
3720081000 IE Pensions 2016-09-01 2019-12-31
3720081000 IE Pensions 2019-09-01 2025-12-31
3720081000 IE Pensions 2020-03-01 2025-12-31
3720082395 Standard Bonus J 2020-05-25 2020-12-31
3720082395 Standard Bonus J 2020-09-25 2025-12-31
So, if an employee number present any error in any of the elements they have, I want to return all the rows of that element by "PERSON NUMBER"
What would be the best way to build this verification?
Let's start from a small correction in your sample data: One of values in Element_End_Date is 2020-04-31, which is not any valid date (April has only 30 days), so I changed it to 2020-04-30.
I also assume that:
A preparatory step to do your task, is to define a function to get rows with overlapping date ranges, for each group of rows:
def getOverlapping(grp):
ind = pd.IntervalIndex.from_arrays(grp.Element_Start_Date, grp.Element_End_Date)
ovl = [ind.overlaps(x).sum() > 1 for x in ind]
return grp[ovl]
And to get the result, apply it:
df.set_index('ELEMENT_NAME', append=True).groupby(level=[0,1])\
.apply(getOverlapping).reset_index(level=[2, 3], drop=True).reset_index(level=1)
The result is:
ELEMENT_NAME Element_Start_Date Element_End_Date
PERSON_NUMBER
3720081000 IE Pensions 2016-09-01 2019-12-31
3720081000 IE Pensions 2019-09-01 2025-12-31
3720081000 IE Pensions 2020-03-01 2025-12-31
3720082395 Standard Bonus J 2020-05-25 2020-12-31
3720082395 Standard Bonus J 2020-09-25 2025-12-31