How to detect automatically the error in DataFrame Pandas

I have a dataframe like this :

id = ['A','A','A','A']
date = ['2022-01-01','2022-01-02','2022-01-03','2022-01-04']
serial_on=['A','C','D','D']
serial_off=['B','A','C','Z']
df= pd.DataFrame({'id':id,'date':date,'serial_on':serial_on,'serial_off':serial_off})

It is about removal-installation of an equipment.

For example, on 2022-01-01, I removed the equipment with serial number B and I installed the equipement with serial number A.

Normally, If one is installed, it should be removed before installing new equipment.

However, as you see, serial number D is reinstalled without removal.

I want to detect thess serial number.

From How to divide by two groups (Python Pandas DataFrame)?

I think we can approch by graph method, but I have no idea how to apply it automatically to find a serial number having problem.

Solution

So, if I understand what you want to do, you want to identify whenever am istallation has be done without removal. So, in that case you will have two instances of a date on the installation.

So given your data:


id = ['A','A','A','A']
date = ['2022-01-01','2022-01-02','2022-01-03','2022-01-04']
serial_on=['A','C','D','D']
serial_off=['B','A','C','Z']
df= pd.DataFrame({'id':id,'date':date,'serial_on':serial_on,'serial_off':serial_off})
print(df)

which is

  id        date serial_on serial_off
0  A  2022-01-01         A          B
1  A  2022-01-02         C          A
2  A  2022-01-03         D          C
3  A  2022-01-04         D          Z

you want to group by the dates using a network analyze type of technique. For this you cn do this:

serials = pd.concat([df['serial_on'], df['serial_off']]).unique()
events = []
for s in serials:
    on_dates = list(df.loc[df['serial_on'] == s, 'date'])
    off_dates = list(df.loc[df['serial_off'] == s, 'date'])
    events.append({'serial': s, 'on_dates': on_dates, 'off_dates': off_dates})

events_df = pd.DataFrame(events)

which returns:

 serial                  on_dates     off_dates
0      A              [2022-01-01]  [2022-01-02]
1      C              [2022-01-02]  [2022-01-03]
2      D  [2022-01-03, 2022-01-04]            []
3      B                        []  [2022-01-01]
4      Z                        []  [2022-01-04]

As you can see, for D you have on dates (installation dates) without removal of D prior to its installation.