Search code examples
pythonpandastime-seriestimespandata-munging

Pandas timespan and groups: Need to groupby/pivot with index as group id with columns that correspond to most recent period values


I have a table that looks like this:

   Index Group_Id   Period Start    Period End  Value   Value_Count
    42   1016833    2012-01-01   2013-01-01     127491.00     17.0
    43   1016833    2013-01-01   2014-01-01     48289.00      9.0
    44   1016833    2014-01-01   2015-01-01     2048.00       2.0
    45   1016926    2012-02-01   2013-02-01     913.00        1.0
    46   1016926    2013-02-01   2014-02-01     6084.00       5.0
    47   1016926    2014-02-01   2015-02-01     29942.00      3.0
    48   1016971    2014-03-01   2015-03-01     0.00          0.0

I am trying to end up with a 'wide' df where each Group_Id has one observation and the value/value counts are converted to columns that correspond to their respective period in order of recency. So the end result would like like:

   Index Group_Id   Value_P0    Value_P1    Value_P3    Count_P0    Count_P1 ... 
    42   1016833    2048.00      48289.00   127491.00     2.0         9.0   
    45   1016926    29942.00     6084.00    913.00        3.0         5.0
    48   1016971    0.0          0.00       0.0           0.0         0.0

Where Value_P0 is the most recent value, Value_P1 is the next most recent value after that, and the Count columns work the same way.

I've tried pivoting the table so that the Group_IDs are the indices and Period Start is the columns and Values or Counts is the corresponding value.

Period Start    2006-07-01  2008-07-01  2009-02-01  2009-12-17  2010-02-01  2010-06-01  2010-07-01  2010-08-13  2010-09-01  2010-12-01  ... 2016-10-02  2016-10-20  2016-12-29  2017-01-05  2017-02-01  2017-03-28  2017-04-10  2017-05-14  2017-08-27  2017-09-15
Group_Id                                                                                    
1007310 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1007318 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1007353 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...

This way I have the Group_Ids as one record but would then need to loop through each row of the many columns and pull out the non-NaN values. Their order would correspond to oldest to newest. This seems like an incorrect way to go about this though.

I've also considered grouping by Group_Id and somehow creating a timedelta that corresponds to the most recent date. Then from this pivoting/unstacking so that the columns are the timedelta and the values are value or value_count. I'm not sure how to do this though. I appreciate the help.


Solution

  • Still using pivot

    df['ID']=df.groupby('Group_Id').cumcount()
    d1=df.pivot('Group_Id','ID','Value').add_prefix('Value_P')
    d2=df.pivot('Group_Id','ID','Value_Count').add_prefix('Count_P')
    pd.concat([d1,d2],axis=1).fillna(0)
    
    Out[347]: 
    ID        Value_P0  Value_P1  Value_P2  Count_P0  Count_P1  Count_P2
    Group_Id                                                            
    1016833   127491.0   48289.0    2048.0      17.0       9.0       2.0
    1016926      913.0    6084.0   29942.0       1.0       5.0       3.0
    1016971        0.0       0.0       0.0       0.0       0.0       0.0