Search code examples
python-3.xpandasdataframepandas-groupbydata-preprocessing

Changing Shape and Structure of Data


I have the following dataset

Office Employee ID Joining Date Attrition Date
AA 700237 27-11-2017
AA 700238 11-01-2018
AA 700252 14-02-2018 08-04-2018
AB 700287 18-01-2014
AB 700449 28-02-2014 17-04-2014

The idea is to add in Active Column if somebody joins and deduct if somebody resigns in any month and therefore want to change it in the following format using python

Office Month & Year Active
AA Jan-17 0
AA Feb-17 0
AA Mar-17 0
AA Apr-17 0
AA May-17 0
AA Jun-17 0
AA Jul-17 0
AA Aug-17 0
AA Sep-17 0
AA Oct-17 0
AA Nov-17 1
AA Dec-17 1
AA Jan-18 2
AA Feb-18 3
AA Mar-18 3
AA Apr-18 2
AB Jan-14 1
AB Feb-14 2
AB Mar-14 2
AB Apr-14 1

Please help.


Solution

  • Use:

    #convert columns to datetimes
    df['Joining Date'] = pd.to_datetime(df['Joining Date'], dayfirst=True)
    df['Attrition Date'] = pd.to_datetime(df['Attrition Date'], dayfirst=True)
    
    #add new rows by first January of minimal year per groups
    df1 = df.groupby('Office')['Joining Date'].min() - pd.offsets.DateOffset(month=1, day=1)
    df = df.append(df1.reset_index()).sort_values(['Office','Joining Date'])
    
    
    #replace missing values in Attrition Date by maximal date with next month
    #replace missing values in Joining  Date by maximal date with next month
    next_month = (df.groupby('Office')['Attrition Date'].transform('max') + 
                   pd.offsets.DateOffset(months=1))
    next_month1 = (df.groupby('Office')['Joining Date'].transform('max') + 
                   pd.offsets.DateOffset(months=1))
    
    df['Attrition Date'] = df['Attrition Date'].fillna(next_month).fillna(next_month1)
    
    
    #explode start and end datetimes converted to months with years
    f = lambda x: pd.date_range(x['Joining Date'], 
                                x['Attrition Date'], freq='M').strftime('%b-%y')
    df['Month & Year'] = df.apply(f, axis=1)
    
    #count number of Employee ID with omit missing values
    df = (df.explode('Month & Year')
            .groupby(['Office','Month & Year'], sort=False)['Employee ID']
            .count()
            .reset_index(name='Active'))
    

    print (df)
       Office Month & Year  Active
    0      AA       Jan-17       0
    1      AA       Feb-17       0
    2      AA       Mar-17       0
    3      AA       Apr-17       0
    4      AA       May-17       0
    5      AA       Jun-17       0
    6      AA       Jul-17       0
    7      AA       Aug-17       0
    8      AA       Sep-17       0
    9      AA       Oct-17       0
    10     AA       Nov-17       1
    11     AA       Dec-17       1
    12     AA       Jan-18       2
    13     AA       Feb-18       3
    14     AA       Mar-18       3
    15     AA       Apr-18       2
    16     AB       Jan-14       1
    17     AB       Feb-14       2
    18     AB       Mar-14       2
    19     AB       Apr-14       1