Search code examples
python-3.xpandasdate-range

How can I turn date intervals given in columns and extract the volumes per hour in the time interval for muiltiple overlapping periods?


I have the following df:

Start_Date   End_Date  Relevant  Volume   
2024-10-01 2024-12-31     False  0.000000   
2025-01-01 2025-03-31      True  0.097989   
2025-04-01 2025-06-30      True -0.014449   
2025-01-01 2025-12-31      True  0.195327   
2026-01-01 2026-12-31     False  0.000000   

I need an hourly index starting/ending at the first/last dates, where Relevant == True. I do this as follows:

    relevant_df = df[df['Relevant']]    
    earliest_start = relevant_df['Start_Date'].min()
    latest_end = relevant_df['End_Date'].max()
    # Create DateTime index
    date_range = pd.date_range(start=earliest_start, end=latest_end, freq='H')
    aggregated_volumes = pd.Series(index=date_range, dtype=float)

Now, how can I take the Volume per periode and add them together so that, in this ex, the first three months of 2025, the volume per hour equals 0.097989 + 0.195327 second quarter -0.014449 + 0.195327 etc..


Solution

  • Since your intervals are overlapping, I believe there is no direct way to index your values.

    You can however create a NxM numpy array (N: number of rows, M: number of True rows) and sum:

    # ensure datetime
    relevant_df[['Start_Date', 'End_Date']] = relevant_df[['Start_Date', 'End_Date']].apply(pd.to_datetime)
    
    # compute a mask of the values between start/end
    idx = date_range.to_numpy()[:, None]
    m1 = idx>=relevant_df['Start_Date'].to_numpy()
    m2 = idx<relevant_df['End_Date'].to_numpy()
    
    # broadcast the values, sum, convert to Series
    out = pd.Series(np.nansum(np.where(m1&m2, relevant_df['Volume'].to_numpy(), np.nan), axis=1),
                    index=date_range)
    

    Output:

    2025-01-01 00:00:00    0.293316
    2025-01-01 01:00:00    0.293316
    2025-01-01 02:00:00    0.293316
    2025-01-01 03:00:00    0.293316
    2025-01-01 04:00:00    0.293316
                             ...   
    2025-12-30 20:00:00    0.195327
    2025-12-30 21:00:00    0.195327
    2025-12-30 22:00:00    0.195327
    2025-12-30 23:00:00    0.195327
    2025-12-31 00:00:00    0.000000
    Freq: h, Length: 8737, dtype: float64
    

    Alternatively, if you have sparse intervals relative to the full range:

    out = pd.DataFrame([pd.Series(v, index=pd.date_range(s, e, inclusive='left', freq='h'))
                        for s,e,v in zip(relevant_df['Start_Date'],
                                         relevant_df['End_Date'],
                                         relevant_df['Volume']
                                        )]).sum()
    

    Output:

    2025-01-01 00:00:00    0.293316
    2025-01-01 01:00:00    0.293316
    2025-01-01 02:00:00    0.293316
    2025-01-01 03:00:00    0.293316
    2025-01-01 04:00:00    0.293316
                             ...   
    2025-12-30 19:00:00    0.195327
    2025-12-30 20:00:00    0.195327
    2025-12-30 21:00:00    0.195327
    2025-12-30 22:00:00    0.195327
    2025-12-30 23:00:00    0.195327
    Freq: h, Length: 8736, dtype: float64
    

    Visual output:

    enter image description here