I am working with DataFrames with values of website hits and minutes per day, I need to group (and sum) them by quarter but with two caveats:
-Quarters start date is non standard (i.e Q1 is 27th Feb to 26th May...) -I have a wide range of dates for different websites so I want to be able to specify a day and month (irrespective of year) and have the df grouped accordingly.
Please find below a way to replicate the type of DataFrame I am working with
import random
import pandas as pd
date_range = pd.date_range(start='2018-1-1', end='2022-10-03')
daily_views = [random.randint(1000,9999) for i in range(len(date_range))]
daily_minutes = [random.randint(1000,9999) for i in range(len(date_range))]
df = pd.DataFrame(
{'DailyViews': daily_views, 'DailyMinutes': daily_minutes},
index=date_range,
)
So far have tried df_grouped = df.groupby(df.index.shift(freq='Q')).sum()
but could not get the offset to work with this as it seems to want to offset by a number of quarters and I am looking for finer control.
Have also tried df_resampled = df.resample('Q', convention='end', offset=datetime.timedelta(days=25)).sum()
but changing the offset does not seem to affect the output.
Currently trying to manually compute the quarters given the input and manually checking each row (with a .apply()
) before doing pd.todatetime()
and grouping. But this feels very inefficient and long winded and there must be a simpler/more elegant way to get to the answer?
Any help would be greatly appreciated!
Many Thanks
EDIT, TEMPORARY SOLUTION:
I have implemented a temporary solution to allow me to change the starting date of quarters by increments of one month:
quarter_start_month = 2
month_to_quarter_mapping = {
i + 1: (12 + ((i - (quarter_start_month - 1) % 3) // 3)) % 4 + 1 for i in range(12)
}
df["QMap"] = df.index.month.map(month_to_quarter_mapping)
df["YMap"] = np.where(
(df.QMap == 4) & (df.index.month.to_series(index=df.index) < 5),
df.index.year.map(lambda x: x - 1),
df.index.year)
df["Quarter"] = df.YMap.astype(str) + "Q" + df.QMap.astype(str)
df.drop(columns=["QMap", "YMap"], inplace=True)
df = df.groupby("Quarter").sum()
Basically a very manual way of extracting what quarter a date is in as a string and then grouping via that column.
Caveat is that I am stuck indexing with a string (i.e "2020Q1"
) as if I try to convert back pd.to_datetime(df.index)
it will interpret the quarters as standard quarters and put the start date of the normal quarter even when it is not actually in the modified quarter
BONUS QUESTION:
If anyone knows a specific command to have the dataframe display "2022Q1" as opposed to "2022-01-01" that would be very helpful
I finally came up with a solution, my peace of mind is safe !
I am not really sure I understood your second caveat, so I went as if you wanted to group by quarter independently of the year for your whole dataset, but the hardest part was probably to group based on custom quarters anyway (as a lots of pandas' methods round up to the start/end of the month/quarter and the offset, as you said, does not seem to work as we would like to).
# Define those custom quarters as a range of intervals, right bin being excluded:
# I went for an easy solution to guarantee full overlapping by checking data min
# and max year. This is suboptimal as some intervals are possibly useless)
starting_year = df.index.min().year - 1
nb_periods = (df.index.max().year - starting_year) * 4 + 1
quarters_intervals = pd.interval_range(start=pd.Timestamp(f'{starting_year}-11-27'),
freq=pd.offsets.DateOffset(months=3),
periods=nb_periods, closed='left')
# Sort the dates into the custom intervals:
df['QuarterInterval'] = pd.cut(df.index.to_series(), bins=quarters_intervals)
# I don't know why, but it is not possible to use the labels to pd.cut() when
# providing bins as IntervalIndex, so let's map to the quarter name now:
quarters_labels = [f'Q{(i + 3) % 4 + 1}' for i in range(len(quarters_intervals))]
mapper = {I:Q for I, Q in zip(quarters_intervals, quarters_labels)}
df['Quarter'] = df['QuarterInterval'].map(mapper)
# Quick check:
print(df.sample(5))
Output:
DailyViews DailyMinutes QuarterInterval Quarter
2020-12-09 3054 7496 [2020-11-27, 2021-02-27) Q4
2022-04-30 9396 5273 [2022-02-27, 2022-05-27) Q1
2022-02-07 2076 7088 [2021-11-27, 2022-02-27) Q4
2019-10-25 9506 5835 [2019-08-27, 2019-11-27) Q3
2018-09-16 2001 6334 [2018-08-27, 2018-11-27) Q3
and finally...
# Let's get our quarterly grouped statistics:
print(df.groupby('Quarter')[['DailyViews', 'DailyMinutes']].sum())
Output:
DailyViews DailyMinutes
Quarter
Q1 2413595 2420882
Q2 2541627 2533811
Q3 2226040 2212354
Q4 2268091 2320683