pandas datetime pandas-groupby pivot-table hierarchical-clustering

Pandas groupby multi conditions and date difference calculation

I am stuck understanding the method to use. I have the following dataframe:

df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
      'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
      'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
      'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
       }

df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')
df

I need to:

groupby the same 'CODE',
check if the 'DESC' is not the same
check if the 'TYPE' is the same
calculate the month difference between dates that satisfy the previous 2 commands

The expected output is the below:

Solution

The following code uses .drop_duplicates() and .duplicated() to keep or throw out rows from your dataframe that have duplicate values.

How would you calculate a month's difference? A month can be 28, 30 or 31 days. You could divide the end result by 30 and get an indication of the number of months difference. So I kept it in days for now.

import pandas as pd

df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77', 'ZZTNRD77', 'AACCBD', 'AACCBD', 'BCCDN', 'BCCDN', 'BCCDN'],
      'DATE': ['16/05/2019','25/09/2019', '16/03/2020', '27/02/2020', '16/07/2020', '21/07/2020', '13/02/2020', '23/07/2020', '27/02/2020'],
      'TYPE': ['PRI', 'PRI', 'PRI', 'PRI', 'PUB', 'PUB', 'PUB', 'PRI', 'PUB'],
      'DESC' : ['KO', 'OK', 'KO', 'KO', 'KO', 'OK', 'KO', 'OK', 'OK']
       }

df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'], format = '%d/%m/%Y')

# only keep rows that have the same code and type 
df = df[df.duplicated(subset=['CODE', 'TYPE'], keep=False)]

# throw out rows that have the same code and desc
df = df.drop_duplicates(subset=['CODE', 'DESC'], keep=False)

# find previous date
df = df.sort_values(by=['CODE', 'DATE'])
df['previous_date'] = df.groupby('CODE')['DATE'].transform('shift')

# drop rows that don't have a previous date
df = df.dropna()

# calculate the difference between current date and previous date
df['difference_in_dates'] = (df['DATE'] - df['previous_date'])

This results in the following df:

CODE        DATE        TYPE    DESC    previous_date   difference_in_dates
AACCBD      2020-07-21  PUB     OK      2020-07-16      5 days
BBLGLC70M   2019-09-25  PRI     OK      2019-05-16      132 days
BCCDN       2020-02-27  PUB     OK      2020-02-13      14 days