Search code examples
pythonpandasgroup-bypython-datetimepandas-explode

Pandas pivot + date slicing: group by periods of time with partial overlap


I am trying to find a way to 'pivot' my pandas dataframe, but keeping my index by sliced dates. The end goal is to create a range for each index in which each attributes and their values are matched.

I reached the expected output using for loops and other non-vectorized ways, but I would be looking for a vectorized solution since my input dataframe might be quite big.

I am using python 3.11 and pandas>=2.0.0.

Here is an input example :

    index   attribute   start_date  end_date    value
0   index_1 attribute_1 2022-01-01  2022-02-01  1
1   index_1 attribute_1 2022-02-01  2023-01-01  2
2   index_1 attribute_2 2022-01-01  2023-01-01  3
3   index_2 attribute_3 2022-01-01  2023-01-01  4
4   index_3 attribute_4 2022-01-01  2023-01-01  5

What I am trying to obtain is this :

    index   start_date  end_date    attribute_1 attribute_2 attribute_3 attribute_4
0   index_1 2022-01-01  2022-02-01  1           3           None        None
1   index_1 2022-02-01  2023-01-01  2           3           None        None
2   index_2 2022-01-01  2023-01-01  None        None        4           None
3   index_3 2022-01-01  2023-01-01  None        None        None        5

Here is a dictionary to reproduce the input dataframe :

from datetime import datetime
{
    "index": ["index_1", "index_1", "index_1", "index_2", "index_3"],
    "attribute": ["attribute_1", "attribute_1", "attribute_2", "attribute_3", "attribute_4"],
    "start_date": [datetime(2022, 1, 1), datetime(2022, 2, 1), datetime(2022, 1, 1), datetime(2022, 1, 1), datetime(2022, 1, 1)],
    "end_date": [datetime(2022, 2, 1), datetime(2023, 1, 1), datetime(2023, 1, 1), datetime(2023, 1, 1), datetime(2023, 1, 1)],
    "value": [1, 2, 3, 4, 5]
}

Solution

  • To begin with, two modifications I had to bring to your input data for proper processing:

    Actual input used

    1) Poor practice: "index" assigned as var name

    Refrain from using index for naming variables, or columns names in your case. Potential conflicts calling df.index later. To avoid confusions, I have named your 'index' with high caps: 'INDEX'.

    2) Dates overlap: end of period == beginning of next period

    For example: According to your input, index_1's attribute_1 on day 2022-02-01 takes 2 values: both 1 AND 2 at the same time. That is bound to cause unecessary problems. Hence I have made end_date an end of month, instead of the first day of the next, as in subtract one day from a pandas dataframe date column:

    # Removing one day from end_dates
    df['end_date'] = df['end_date'] - pd.tseries.offsets.Day()
    

    Alternatively you could:

    • use months as unit of time, provided this assumption holds for your full input data.
    • or specify an hour.
    Following these remarks, dataframe used as input:
         INDEX    attribute start_date   end_date  value
    0  INDEX_1  attribute_1 2022-01-01 2022-01-31      1
    1  INDEX_1  attribute_1 2022-02-01 2022-12-31      2
    2  INDEX_1  attribute_2 2022-01-01 2022-12-31      3
    3  INDEX_2  attribute_3 2022-01-01 2022-12-31      4
    4  INDEX_3  attribute_4 2022-01-01 2022-12-31      5
    

    Please confirm that those changes are ok with you. Now moving on to the actual solution:


    Strategy

    1. Pivot attributes from index to columns

    2. Reshape dataframe into time series, as in Convert dataframe with start and end date to daily data in 2 steps:

      2a. Merge the tuples (start_date, end_date) into time spans,

      2b. Then explode the time spans into full daily time series,

    3. Aggregate the time series into the requested form

      3a. Merge data corresponding to each day: group by INDEX + days

      3b. Then recreate the relevant start and end dates: group by INDEX + attributes, aggregating timespans by min-max days, as in Max and Min date in pandas groupby


    Processing

    1. Pivot attributes from index to columns
    df_pivot = df.pivot_table(values  ='value',
                              columns = 'attribute',
                              index   = ['INDEX','start_date','end_date']).reset_index()
    
    attribute    INDEX start_date   end_date  attribute_1  attribute_2  attribute_3  attribute_4
    0          INDEX_1 2022-01-01 2022-01-31          1.0          NaN          NaN          NaN
    1          INDEX_1 2022-01-01 2022-12-31          NaN          3.0          NaN          NaN
    2          INDEX_1 2022-02-01 2022-12-31          2.0          NaN          NaN          NaN
    3          INDEX_2 2022-01-01 2022-12-31          NaN          NaN          4.0          NaN
    4          INDEX_3 2022-01-01 2022-12-31          NaN          NaN          NaN          5.0
    
    2. Reshape dataframe into time series
    df_pivot['timespan'] = df_pivot.apply(lambda row: pd.date_range(row['start_date'], row['end_date']), axis=1)
    
    attribute    INDEX start_date   end_date  attribute_1  attribute_2  attribute_3  attribute_4                                           timespan
    0          INDEX_1 2022-01-01 2022-01-31          1.0          NaN          NaN          NaN  DatetimeIndex(['2022-01-01', '2022-01-02', '20...
    1          INDEX_1 2022-01-01 2022-12-31          NaN          3.0          NaN          NaN  DatetimeIndex(['2022-01-01', '2022-01-02', '20...
    2          INDEX_1 2022-02-01 2022-12-31          2.0          NaN          NaN          NaN  DatetimeIndex(['2022-02-01', '2022-02-02', '20...
    3          INDEX_2 2022-01-01 2022-12-31          NaN          NaN          4.0          NaN  DatetimeIndex(['2022-01-01', '2022-01-02', '20...
    4          INDEX_3 2022-01-01 2022-12-31          NaN          NaN          NaN          5.0  DatetimeIndex(['2022-01-01', '2022-01-02', '20...
    
    TS = df_pivot.explode('timespan', ignore_index=True)
                 .drop(columns=['start_date', 'end_date'])
    
    attribute    INDEX  attribute_1  attribute_2  attribute_3  attribute_4   timespan
    0          INDEX_1          1.0          NaN          NaN          NaN 2022-01-01
    1          INDEX_1          1.0          NaN          NaN          NaN 2022-01-02
    2          INDEX_1          1.0          NaN          NaN          NaN 2022-01-03
    3          INDEX_1          1.0          NaN          NaN          NaN 2022-01-04
    4          INDEX_1          1.0          NaN          NaN          NaN 2022-01-05
    ...            ...          ...          ...          ...          ...        ...
    1455       INDEX_3          NaN          NaN          NaN          5.0 2022-12-27
    1456       INDEX_3          NaN          NaN          NaN          5.0 2022-12-28
    1457       INDEX_3          NaN          NaN          NaN          5.0 2022-12-29
    1458       INDEX_3          NaN          NaN          NaN          5.0 2022-12-30
    1459       INDEX_3          NaN          NaN          NaN          5.0 2022-12-31
    

    At this point, one day still appears several times within each INDEX. You will see if you plot TS.plot(y='timespan')

    3. Aggregate the time series into the requested form
    TS = TS.groupby(by=['INDEX','timespan'])
           .max()
           .reset_index()
    
    attribute    INDEX   timespan  attribute_1  attribute_2  attribute_3  attribute_4
    0          INDEX_1 2022-01-01          1.0          3.0          NaN          NaN
    1          INDEX_1 2022-01-02          1.0          3.0          NaN          NaN
    2          INDEX_1 2022-01-03          1.0          3.0          NaN          NaN
    3          INDEX_1 2022-01-04          1.0          3.0          NaN          NaN
    4          INDEX_1 2022-01-05          1.0          3.0          NaN          NaN
    ...            ...        ...          ...          ...          ...          ...
    1090       INDEX_3 2022-12-27          NaN          NaN          NaN          5.0
    1091       INDEX_3 2022-12-28          NaN          NaN          NaN          5.0
    1092       INDEX_3 2022-12-29          NaN          NaN          NaN          5.0
    1093       INDEX_3 2022-12-30          NaN          NaN          NaN          5.0
    1094       INDEX_3 2022-12-31          NaN          NaN          NaN          5.0
    

    Now each day appears only once per INDEX.

    Choice of agg function: We took care that there was only ever one possible value per day per attribute per INDEX, so any agg function could do. However, aggregating by either max() (or min()) will preserve the NaNs. Unlike sum(), which converts them to zeroes.

    df_final = TS.groupby(by=['INDEX','attribute_1','attribute_2','attribute_3','attribute_4'], dropna=False)
                 .agg({'timespan' : [min, max]})
                 .reset_index()
    
    attribute    INDEX attribute_1 attribute_2 attribute_3 attribute_4   timespan           
                                                                              min        max
    0          INDEX_1         1.0         3.0         NaN         NaN 2022-01-01 2022-01-31
    1          INDEX_1         2.0         3.0         NaN         NaN 2022-02-01 2022-12-31
    2          INDEX_2         NaN         NaN         4.0         NaN 2022-01-01 2022-12-31
    3          INDEX_3         NaN         NaN         NaN         5.0 2022-01-01 2022-12-31
    

    Note that dropna=False as groupby option, as in pandas GroupBy columns with NaN (missing) values, allows NaNs in the grouper, i.e. "NaN ==NaN" being false does not ruin the grouping.

    Basically we have your resquested table here. Finally we could discuss bringing the final touch, such as

    • flattening the multi-indices,
    • or compacting some code lines into one-liners where applicable.

    Potential bug? Step 3b: I wonder, if an identical combination of attributes were to repeat itself at different time spans, then the min/max approach should fail to reconstitute proper start_date and end_date. I believe this might be something to take care of.