Search code examples
pythonpython-3.xpandasdataframegroup-by

Pandas DataFrame grouping by custom dates (quarters)


I am working with DataFrames with values of website hits and minutes per day, I need to group (and sum) them by quarter but with two caveats:

-Quarters start date is non standard (i.e Q1 is 27th Feb to 26th May...) -I have a wide range of dates for different websites so I want to be able to specify a day and month (irrespective of year) and have the df grouped accordingly.

Please find below a way to replicate the type of DataFrame I am working with

import random
import pandas as pd

date_range = pd.date_range(start='2018-1-1', end='2022-10-03')
daily_views = [random.randint(1000,9999) for i in range(len(date_range))]
daily_minutes = [random.randint(1000,9999) for i in range(len(date_range))]

df = pd.DataFrame(
    {'DailyViews': daily_views, 'DailyMinutes': daily_minutes},
    index=date_range,
)

So far have tried df_grouped = df.groupby(df.index.shift(freq='Q')).sum() but could not get the offset to work with this as it seems to want to offset by a number of quarters and I am looking for finer control.

Have also tried df_resampled = df.resample('Q', convention='end', offset=datetime.timedelta(days=25)).sum() but changing the offset does not seem to affect the output.

Currently trying to manually compute the quarters given the input and manually checking each row (with a .apply()) before doing pd.todatetime() and grouping. But this feels very inefficient and long winded and there must be a simpler/more elegant way to get to the answer?

Any help would be greatly appreciated!

Many Thanks

EDIT, TEMPORARY SOLUTION:

I have implemented a temporary solution to allow me to change the starting date of quarters by increments of one month:

quarter_start_month = 2
month_to_quarter_mapping = {
    i + 1: (12 + ((i - (quarter_start_month - 1) % 3) // 3)) % 4 + 1 for i in range(12)
}

df["QMap"] = df.index.month.map(month_to_quarter_mapping)
df["YMap"] = np.where(
    (df.QMap == 4) & (df.index.month.to_series(index=df.index) < 5),
    df.index.year.map(lambda x: x - 1),
    df.index.year)
df["Quarter"] = df.YMap.astype(str) + "Q" + df.QMap.astype(str)

df.drop(columns=["QMap", "YMap"], inplace=True)
df = df.groupby("Quarter").sum()

Basically a very manual way of extracting what quarter a date is in as a string and then grouping via that column.

Caveat is that I am stuck indexing with a string (i.e "2020Q1") as if I try to convert back pd.to_datetime(df.index) it will interpret the quarters as standard quarters and put the start date of the normal quarter even when it is not actually in the modified quarter

BONUS QUESTION:

If anyone knows a specific command to have the dataframe display "2022Q1" as opposed to "2022-01-01" that would be very helpful


Solution

  • I finally came up with a solution, my peace of mind is safe !

    I am not really sure I understood your second caveat, so I went as if you wanted to group by quarter independently of the year for your whole dataset, but the hardest part was probably to group based on custom quarters anyway (as a lots of pandas' methods round up to the start/end of the month/quarter and the offset, as you said, does not seem to work as we would like to).

    # Define those custom quarters as a range of intervals, right bin being excluded:
    # I went for an easy solution to guarantee full overlapping by checking data min
    # and max year. This is suboptimal as some intervals are possibly useless)
    starting_year = df.index.min().year - 1
    nb_periods = (df.index.max().year - starting_year) * 4 + 1
    quarters_intervals = pd.interval_range(start=pd.Timestamp(f'{starting_year}-11-27'),
                                           freq=pd.offsets.DateOffset(months=3),
                                           periods=nb_periods, closed='left')
    
    # Sort the dates into the custom intervals:
    df['QuarterInterval'] = pd.cut(df.index.to_series(), bins=quarters_intervals)
    
    # I don't know why, but it is not possible to use the labels to pd.cut() when
    # providing bins as IntervalIndex, so let's map to the quarter name now:
    quarters_labels = [f'Q{(i + 3) % 4 + 1}' for i in range(len(quarters_intervals))]
    mapper = {I:Q for I, Q in zip(quarters_intervals, quarters_labels)}
    df['Quarter'] = df['QuarterInterval'].map(mapper)
    
    # Quick check:
    print(df.sample(5))
    

    Output:

                DailyViews  DailyMinutes           QuarterInterval Quarter
    2020-12-09        3054          7496  [2020-11-27, 2021-02-27)      Q4
    2022-04-30        9396          5273  [2022-02-27, 2022-05-27)      Q1
    2022-02-07        2076          7088  [2021-11-27, 2022-02-27)      Q4
    2019-10-25        9506          5835  [2019-08-27, 2019-11-27)      Q3
    2018-09-16        2001          6334  [2018-08-27, 2018-11-27)      Q3
    

    and finally...

    # Let's get our quarterly grouped statistics:
    print(df.groupby('Quarter')[['DailyViews', 'DailyMinutes']].sum())
    

    Output:

             DailyViews  DailyMinutes
    Quarter                          
    Q1          2413595       2420882
    Q2          2541627       2533811
    Q3          2226040       2212354
    Q4          2268091       2320683