Search code examples
pythonpandasdataframegraph

How to detect automatically the error in DataFrame Pandas


I have a dataframe like this :

id = ['A','A','A','A']
date = ['2022-01-01','2022-01-02','2022-01-03','2022-01-04']
serial_on=['A','C','D','D']
serial_off=['B','A','C','Z']
df= pd.DataFrame({'id':id,'date':date,'serial_on':serial_on,'serial_off':serial_off})

enter image description here

It is about removal-installation of an equipment.

For example, on 2022-01-01, I removed the equipment with serial number B and I installed the equipement with serial number A.

Normally, If one is installed, it should be removed before installing new equipment.

However, as you see, serial number D is reinstalled without removal.

I want to detect thess serial number.

From How to divide by two groups (Python Pandas DataFrame)?

I think we can approch by graph method, but I have no idea how to apply it automatically to find a serial number having problem.


Solution

  • So, if I understand what you want to do, you want to identify whenever am istallation has be done without removal. So, in that case you will have two instances of a date on the installation.

    So given your data:

    
    id = ['A','A','A','A']
    date = ['2022-01-01','2022-01-02','2022-01-03','2022-01-04']
    serial_on=['A','C','D','D']
    serial_off=['B','A','C','Z']
    df= pd.DataFrame({'id':id,'date':date,'serial_on':serial_on,'serial_off':serial_off})
    print(df)
    

    which is

      id        date serial_on serial_off
    0  A  2022-01-01         A          B
    1  A  2022-01-02         C          A
    2  A  2022-01-03         D          C
    3  A  2022-01-04         D          Z
    

    you want to group by the dates using a network analyze type of technique. For this you cn do this:

    serials = pd.concat([df['serial_on'], df['serial_off']]).unique()
    events = []
    for s in serials:
        on_dates = list(df.loc[df['serial_on'] == s, 'date'])
        off_dates = list(df.loc[df['serial_off'] == s, 'date'])
        events.append({'serial': s, 'on_dates': on_dates, 'off_dates': off_dates})
    
    events_df = pd.DataFrame(events)
    

    which returns:

     serial                  on_dates     off_dates
    0      A              [2022-01-01]  [2022-01-02]
    1      C              [2022-01-02]  [2022-01-03]
    2      D  [2022-01-03, 2022-01-04]            []
    3      B                        []  [2022-01-01]
    4      Z                        []  [2022-01-04]
    

    As you can see, for D you have on dates (installation dates) without removal of D prior to its installation.