Search code examples
pythonpandaspandas-groupby

Create a time series that sums data on each day D, if D is between the start date and the end date


My raw data is a dataframe with three columns that describe journeys: quantity, start date, end date. My goal is to create a new dataframe with a daily index and one single column that shows the sum of the quantities of the journeys that were "on the way" each day i.e. sum quantity if day > start date and day < end date. I think I can achieve this by creating a daily index and then using a for loop that on each day uses a mask to filter the data, then sums. I haven't managed to make it work but I think that there might actually be a better approach? Below is my attempt with some dummy data...

data = [[10, '2020-03-02', '2020-03-27'],
        [18, '2020-03-06', '2020-03-10'],
        [21, '2020-03-20', '2020-05-02'],
        [33, '2020-01-02', '2020-03-01']]
columns = ['quantity', 'startdate', 'enddate']
index = [1,2,3,4]
df = pd.DataFrame(data,index,columns)

index2 = pd.date_range(start='2020-01-01', end='2020-06-01', freq='D')
df2 = pd.DataFrame(0,index2,'quantities')
for t in index2:
    mask = (df['start']<t) & (df['end']>t)
    df2['quantities'] = df[mask]['quantity'].sum()

Solution

  • Maybe you could create date range for each record, then explode and groupby:

    data = [[10, '2020-03-02', '2020-03-27'],
            [18, '2020-03-06', '2020-03-10'],
            [21, '2020-03-20', '2020-05-02'],
            [33, '2020-01-02', '2020-03-01']]
    columns = ['quantity', 'startdate', 'enddate']
    index = [1,2,3,4]
    df = pd.DataFrame(data,index,columns)
    
    df['range'] = df.apply(lambda x: pd.date_range(x['startdate'],x['enddate'],freq='D'), axis=1)
    df = df.explode('range')
    df.groupby('range')['quantity'].sum()