Code to generate random database for question (minimum reproducible issue):
df_random = pd.DataFrame(np.random.random((2000,3)))
df_random['order_date'] = pd.date_range(start='1/1/2015',
periods=len(df_random), freq='D')
df_random['customer_id'] = np.random.randint(1, 20, df_random.shape[0])
df_random
Output df_random
0 1 2 order_date customer_id
0 0.018473 0.970257 0.605428 2015-01-01 12
... ... ... ... ... ...
1999 0.800139 0.746605 0.551530 2020-06-22 11
Code to extract mean unique transactions month and year wise
for y in (2015,2019):
for x in (1,13):
df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
df2.sort_values(['customer_id','order_date'],inplace=True)
df2["days"] = df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D"))
df_mean=round(df2['days'].mean(),2)
data2 = data.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
print(data2)
Expected output
Mean Month Year
0 5.00 1 2015
.......................
11 6.62 12 2015
..............Mean values of days after which one transaction occurs in order_date for years 2016 and 2017 Jan to Dec
36 6.03 1 2018
..........................
47 6.76 12 2018
48 8.40 1 2019
.......................
48 8.40 12 2019
Basically I want single dataframe starting from 2015 Jan month to 2019 December
Instead of the expected output I am getting dataframe from Jan 2015 to Dec 2018 , then again Jan 2015 data and then the entire dataset repeats again from 2015 to 2018 many more times.
Please help
Try this:
data2 = pd.DataFrame([])
for y in range(2015,2020):
for x in range(1,13):
df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
df_mean=df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")).mean().round(2)
data2 = data2.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
print(data2)
Try this :
df_random.order_date = pd.to_datetime(df_random.order_date)
df_random = df_random.set_index(pd.DatetimeIndex(df_random['order_date']))
output = df_random.groupby(pd.Grouper(freq="M"))[[0,1,2]].agg(np.mean).reset_index()
output['month'] = output.order_date.dt.month
output['year'] = output.order_date.dt.year
output = output.drop('order_date', axis=1)
output
Output
0 1 2 month year
0 0.494818 0.476514 0.496059 1 2015
1 0.451611 0.437638 0.536607 2 2015
2 0.476262 0.567519 0.528129 3 2015
3 0.519229 0.475887 0.612433 4 2015
4 0.464781 0.430593 0.445455 5 2015
... ... ... ... ... ...
61 0.416540 0.564928 0.444234 2 2020
62 0.553787 0.423576 0.422580 3 2020
63 0.524872 0.470346 0.560194 4 2020
64 0.530440 0.469957 0.566077 5 2020
65 0.584474 0.487195 0.557567 6 2020