Search code examples
pythonpandasgroupingbinning

Python: Grouping values of different columns into time buckets


Say you have this DataFrame:

Name    Item    Date    value1  value2
Marc    bike    21-Dec-17   7   1000
Marc    bike    05-Jan-18   9   2000
Marc    bike    27-Jul-18   4   500
John    house   14-Dec-17   4   500
John    house   02-Feb-18   6   500
John    house   07-Feb-18   8   1000
John    house   16-Feb-18   2   1000
John    house   05-Dec-21   7   1000
John    house   27-Aug-25   8   500
John    car     17-Apr-18   4   500

I would like to bin value1 and value2 into monthly buckets (each 3rd wednesday for the next 48 months) for each name-item-combination.

So there are 49 time buckets for each combination with the sum of value1 and value2 for each month: Marc/bike, John/house, John/car, ...

The solution for John/house would look like:

Name    Item    TimeBucket  value1  value2
John    house   20-Dec-17   4   500
John    house   17-Jan-18   0   0
John    house   21-Feb-18   16  2500
John    house   21-Mar-18   0   0
John    house   18-Apr-18   0   0
John    house   …           0   0
John    house   17-Nov-21   0   0
John    house   15-Dec-21   7   1000
John    house   rest        8   500

i cant get to a result with pandas. The only solution I can think of is a row-by-row iteration through the dataframe, but I'd really like to avoid having to do this. Is there an elegant way to do it?


Solution

  • The question really boils down to three steps:

    1. How to find the third Wednesday of every month?

    This may not be the most elegant solution, but you can filter out the third Wednesday of every month by masking a pandas DatetimeIndex which contains every day in the time frame.

    # generate a DatetimeIndex for all days in the relevant time frame
    from datetime import datetime
    start = datetime(2017, 12, 1)
    end = datetime(2022, 1, 31)
    days = pd.date_range(start, end, freq='D')
    
    # filter out only the third wednesday of each month
    import itertools
    third_wednesdays = []
    for year, month in itertools.product(range(2017, 2023), range(1,13)):
        mask = (days.weekday == 2) & \
            (days.year == year) & \
            (days.month == month)
        if len(days[mask]) > 0:
            third_wednesdays.append(days[mask][2])
    bucket_lower_bounds = pd.DatetimeIndex(third_wednesdays)
    

    Convert the resulting list to a DatetimeIndex so you can use it as the lower bounds of the bins in step 2.

    2. How to bin values of a DataFrame?

    Then, once you have the list of buckets as a DatetimeIndex, you can simply use panda's cut function to assign each date to a bucket. Convert date columns to integers before passing them into cut, and then convert the result back to dates:

    time_buckets = pd.to_datetime(
        pd.cut(
            x = pd.to_numeric(df['Date']), 
            bins = pd.to_numeric(bucket_lower_bounds), 
            labels = bucket_lower_bounds[:-1]
        )
    )
    

    The series time_buckets assigns each index value of your original data frame to a bucket's lower bound. We can simply add it to the original data frame now:

    df['TimeBucket'] = time_buckets
    

    The result should look somewhat like this (not that NaT stands for the "rest" bucket):

        Name    Item    Date    value1  value2  TimeBucket
    0   Marc    bike    2017-12-21  7   1000    2017-12-20
    1   Marc    bike    2018-01-05  9   2000    2017-12-20
    2   Marc    bike    2018-07-27  4   500     2018-07-18
    3   John    house   2017-12-14  4   500     NaT
    4   John    house   2018-02-02  6   500     2018-01-17
    5   John    house   2018-02-07  8   1000    2018-01-17
    6   John    house   2018-02-16  2   1000    2018-01-17
    7   John    house   2021-12-05  7   1000    2021-11-17
    8   John    house   2025-08-27  8   500     NaT
    9   John    car     2018-04-17  4   500     2018-03-21
    

    3. How to aggregate a binned DataFrame?

    Now it's as simple as using groupby to get the sums for each combination of name, item and bucket:

    df.groupby(['Name','Item','TimeBucket']).sum()
    

    Result:

    Name    Item    TimeBucket  value1  value2
    John    car     2018-03-21  4       500
            house   2018-01-17  16      2500
                    2021-11-17  7       1000
    Marc    bike    2017-12-20  16      3000
                    2018-07-18  4       500
    

    Unfortunately, NaT values are excluded from groupby. If you need to sum those as well, perhaps it would be easiest to make sure that your list of buckets has at least one bucket for every date in your input range.

    Edit: Step 2 requires pandas version >= 0.18.1.