Search code examples
pythonpandasdataframepandas-groupbypython-datetime

Pandas - Cross referencing with DatetimeIndex - Groupby


I have data of many companies by month (End of Month). I want to create a new columns with groupby for each company where:

  • new_col from Jul of this year to Jun of next year will take value of Dec last year
  • For example, new_col value from Jul-2000 to Jun-2001 will be equal value of old_col Dec-1999

You can download sample data here: https://www.dropbox.com/s/oz1ltblh6u0chzt/tem_20220506.csv?dl=0

I have been trying with this code line but unsuccessful:

df = pd.read_csv('tem_20220506.csv', parse_dates=['date'])
df.drop(columns=['new_col', 'Note'], inplace=True)
df = df.set_index('date').rename_axis(None)

df['new_col'] = df.groupby('comp').apply(lambda g: --- ) # ← I am now stuck here

Desire output:

           comp  old_col  new_col  \
2000-01-31    a        1      NaN   
2000-02-29    a        2      NaN   
2000-03-31    a        3      NaN   
2000-04-30    a        4      NaN   
2000-05-31    a        5      NaN   
2000-06-30    a        6      NaN   
2000-07-31    a        7      NaN   
2000-08-31    a        8      NaN   
2000-09-30    a        9      NaN   
2000-10-31    a       10      NaN   
2000-11-30    a       11      NaN   
2000-12-31    a       12      NaN   
2001-01-31    a       13      NaN   
2001-02-28    a       14      NaN   
2001-03-31    a       15      NaN   
2001-04-30    a       16      NaN   
2001-05-31    a       17      NaN   
2001-06-30    a       18      NaN   
2001-07-31    a       19   12.000   
2001-08-31    a       20   12.000   
2001-09-30    a       21   12.000   
2001-10-31    a       22   12.000   
2001-11-30    a       23   12.000   
2001-12-31    a       24   12.000   
2002-01-31    a       25   12.000   
2002-02-28    a       26   12.000   
2002-03-31    a       27   12.000   
2002-04-30    a       28   12.000   
2002-05-31    a       29   12.000   
2002-06-30    a       30   12.000   
2002-07-31    a       31   24.000   
2002-08-31    a       32   24.000   
2002-09-30    a       33   24.000   
2002-10-31    a       34   24.000   
2002-11-30    a       35   24.000   
2002-12-31    a       36   24.000   
2000-01-31    b      101      NaN   
2000-02-29    b      102      NaN   
2000-03-31    b      103      NaN   
2000-04-30    b      104      NaN   
2000-05-31    b      105      NaN   
2000-06-30    b      106      NaN   
2000-07-31    b      107      NaN   
2000-08-31    b      108      NaN   
2000-09-30    b      109      NaN   
2000-10-31    b      110      NaN   
2000-11-30    b      111      NaN   
2001-01-31    b      113      NaN   
2001-02-28    b      114      NaN   
2001-03-31    b      115      NaN   
2001-04-30    b      116      NaN   
2001-05-31    b      117      NaN   
2001-06-30    b      118      NaN   
2001-07-31    b      119      NaN   
2001-08-31    b      120      NaN   
2001-09-30    b      121      NaN   
2001-10-31    b      122      NaN   
2001-11-30    b      123      NaN   
2001-12-31    b      124      NaN   
2002-01-31    b      125      NaN   
2002-02-28    b      126      NaN   
2002-03-31    b      127      NaN   
2002-04-30    b      128      NaN   
2002-05-31    b      129      NaN   
2002-06-30    b      130      NaN   
2002-07-31    b      131  124.000   
2002-08-31    b      132  124.000   
2002-10-31    b      134  124.000   
2002-11-30    b      135  124.000   
2002-12-31    b      136  124.000   

(!!) Note that: for comp==b:

  • It is NaN from Jul-2001 to Jun-2002 because Dec-2000 value is missing

  • There is missing Sep-2002, but it is ok


Solution

  • df = pd.read_csv('tem_20220506.csv', parse_dates=['date'])
    df.drop(columns=['new_col', 'Note'], inplace=True)
    df.set_index('date', inplace=True)
    

    using a helper function to get new col based on old col

    def helper_func(x):
        # get the date values corresponding to month = 12
        req_values = x[x.index.month == 12].to_dict()['old_col']
    
        # iterate over those dates and replace July to June range depending on the year of the date
        for date_value, old_col_value in req_values.items():
            x.loc[f'{date_value.year+1}-07-31':f'{date_value.year+2}-06-30', 'new_col'] = old_col_value
            
        return x
    
    df['new_col'] = df.groupby('comp')[['old_col']].apply(helper_func)['new_col']
    

    This will provide dataframe as your desired output

    An alternate helper function

    def helper_fun2(x):
        """
        1. iterate over years
        2. update July to June next two years value, using DEC value of current year
        """
        for year in x.index.year.unique():
            if f'{year}-12-31' in x.index:
                x.loc[f'{year+1}-07-31':f'{year+2}-06-30', 'new_col'] = x.loc[f'{year}-12-31']['old_col']
            
        return x