Search code examples
pythonpandasmulti-indexdate-range

How to "expand" multi-index with date_range in pandas?


I have an input dataframe that looks like this:

enter image description here

and generated like this using pandas (v1.3.2)

import pandas as pd
import datetime

input_data = [
    ["1", datetime.datetime(2023,2,21,20,0,0), 10],
    ["1", datetime.datetime(2023,2,21,20,30,0), 10],
    ["2", datetime.datetime(2023,2,21,15,0,0), 15],
    ["2", datetime.datetime(2023,2,21,15,30,0), 15],
]

df_input = pd.DataFrame(data=input_data, columns=["id", "time", "duration"]).set_index(["id", "time"])

I want to "expand" the second level of my dataframe (index column "time") according to the slot duration (column "duration"). The output dataframe should look this:

enter image description here

More explanation for the first id ("1") : I want all the slots duration of 10mins from 20:00 to 20:30 -> 20:00, 20:10, 20:20, 20:30.

I came up with a solution (see code snippet below) but it is slow and I wonder if there is anything faster built in pandas to help me deal with this issue.

import pandas as pd
import datetime

input_data = [
    ["1", datetime.datetime(2023,2,21,20,0,0), 10],
    ["1", datetime.datetime(2023,2,21,20,30,0), 10],
    ["2", datetime.datetime(2023,2,21,15,0,0), 15],
    ["2", datetime.datetime(2023,2,21,15,30,0), 15],
]

df_input = pd.DataFrame(data=input_data, columns=["id", "time", "duration"]).set_index(["id", "time"])

df_output = pd.DataFrame()

for i in range(0, df_input.shape[0], 2):
    start_at = df_input.index[i][1]
    end_at = df_input.index[i+1][1]
    duration = df_input.iloc[i]["duration"]
    
    df_cut = pd.DataFrame(
        pd.date_range(
            start=start_at,
            end=end_at,
            freq=f"{duration}min",
        ).rename("start_at_converted")
    )
    df_cut["id"] = df_input.index[i][0]
    df_cut["duration"] = duration
    
    df_output = pd.concat((df_output, df_cut), axis=0)

df_output = df_output.set_index(["id", "start_at_converted"])

Thank you for your help !


Solution

  • You can use groupby.resample:

    freq = {'1': '10min', '2': '15min'}
    
    out = (df_input.reset_index('id').groupby('id')
           .apply(lambda g: g.resample(freq[g.name]).ffill())
           .drop(columns='id')
           # optional, to rename the index
           .rename_axis(('id', 'start_at_converted'))
          )
    

    Output:

                            duration
    id start_at_converted           
    1  2023-02-21 20:00:00        10
       2023-02-21 20:10:00        10
       2023-02-21 20:20:00        10
       2023-02-21 20:30:00        10
    2  2023-02-21 15:00:00        15
       2023-02-21 15:15:00        15
       2023-02-21 15:30:00        15
    

    If you had only one frequency:

    out = (df_input.reset_index('id').groupby('id')
           .resample('15min').ffill().drop(columns='id')
           # optional, to rename the index
           #.rename_axis(('id', 'start_at_converted'))
          )
    

    Output:

                            duration
    id time                         
    1  2023-02-21 20:00:00        10
       2023-02-21 20:15:00        10
       2023-02-21 20:30:00        10
    2  2023-02-21 15:00:00        15
       2023-02-21 15:15:00        15
       2023-02-21 15:30:00        15