I'm trying to determine reporting period from a dataframe of reports. Options for reporting period would be:
The dataframe looks like this:
data = [['F000003Y6H', '2018-07-31'], ['F000003Y6H', '2018-08-31'],
['F000003Y6H', '2018-09-30'], ['F000003Y6H', '2018-10-31'],
['FOUSA06C1Y', '2021-07-31'], ['FOUSA06C1Y', '2021-10-31'],
['FOUSA06C1Y', '2022-1-31'], ['FOUSA06C1Y', '2022-04-30'],
['FOUSA06C1Y', '2022-7-31'], ['FO0005TYH7', '2019-5-31'],
['FO0005TYH7', '2019-6-30'], ['FO0005TYH7', '2019-7-31'],
['FO0005TYH7', '2019-10-31'], ['FO0005TYH7', '2020-1-31']]
df = pd.DataFrame(data, columns=['class_id', 'report_date'])
To determine frequency of reports(for each class_id), I have this:
df['report_date'] = pd.to_datetime(df['report_date'])
df['datediff_days'] = df.groupby(['class_id'])['report_date'].diff()
class_ids = df['class_id'].unique()
def determine_period(class_id):
new_df = df.groupby('class_id')['datediff_days'].describe()
mean_td = new_df.at[class_id, 'mean']
mean_days = mean_td.days
if 28<mean_days<32:
#monthly
#period is number of months
period = 1
elif 85<mean_days<95:
#quarterly
period = 3
else:
period = 0
return period
for x in class_ids:
period = determine_period(x)
print("For class: " + str(x))
print("Period is " + str(period))
Is there a better/simpler way to do this?
You can try to use pandas.DatetimeIndex.inferred_freq
:
df["report_date"] = pd.to_datetime(df["report_date"])
df["inferred_freq"] = df.groupby("class_id")["report_date"].transform(
lambda x: pd.DatetimeIndex(x).inferred_freq
)
print(df)
Prints:
class_id report_date inferred_freq
0 F000003Y6H 2018-07-31 M
1 F000003Y6H 2018-08-31 M
2 F000003Y6H 2018-09-30 M
3 F000003Y6H 2018-10-31 M
4 FOUSA06C1Y 2021-07-31 Q-OCT
5 FOUSA06C1Y 2021-10-31 Q-OCT
6 FOUSA06C1Y 2022-01-31 Q-OCT
7 FOUSA06C1Y 2022-04-30 Q-OCT
8 FOUSA06C1Y 2022-07-31 Q-OCT
9 FO0005TYH7 2019-05-31 NaN
10 FO0005TYH7 2019-06-30 NaN
11 FO0005TYH7 2019-07-31 NaN
12 FO0005TYH7 2019-10-31 NaN
13 FO0005TYH7 2020-01-31 NaN