Search code examples
pandasdatetimepandas-groupbypivot-tablehierarchical-clustering

Pandas groupby multi conditions and date difference calculation


I am stuck understanding the method to use. I have the following dataframe:

df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
      'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
      'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
      'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
       }

df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
df

enter image description here

I need to:

  1. groupby the same 'CODE',
  2. check if the 'DESC' is not the same
  3. check if the 'TYPE' is the same
  4. calculate the month difference between dates that satisfy the previous 2 commands

The expected output is the below:

enter image description here


Solution

  • The following code uses .drop_duplicates() and .duplicated() to keep or throw out rows from your dataframe that have duplicate values.

    How would you calculate a month's difference? A month can be 28, 30 or 31 days. You could divide the end result by 30 and get an indication of the number of months difference. So I kept it in days for now.

    import pandas as pd
    
    df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
          'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
          'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
          'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
           }
    
    df = pd.DataFrame(df)
    df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
    
    # only keep rows that have the same code and type 
    df = df[df.duplicated(subset=['CODE', 'TYPE'], keep=False)]
    
    # throw out rows that have the same code and desc
    df = df.drop_duplicates(subset=['CODE', 'DESC'], keep=False)
    
    # find previous date
    df = df.sort_values(by=['CODE', 'DATE'])
    df['previous_date'] = df.groupby('CODE')['DATE'].transform('shift')
    
    # drop rows that don't have a previous date
    df = df.dropna()
    
    # calculate the difference between current date and previous date
    df['difference_in_dates'] = (df['DATE'] - df['previous_date'])
    

    This results in the following df:

    CODE        DATE        TYPE    DESC    previous_date   difference_in_dates
    AACCBD      2020-07-21  PUB     OK      2020-07-16      5 days
    BBLGLC70M   2019-09-25  PRI     OK      2019-05-16      132 days
    BCCDN       2020-02-27  PUB     OK      2020-02-13      14 days