Search code examples
pythonpandasfor-loopnested-forms

Nested for loop python pandas not functioning as desired


Code to generate random database for question (minimum reproducible issue):

df_random = pd.DataFrame(np.random.random((2000,3)))
df_random['order_date'] = pd.date_range(start='1/1/2015', 
periods=len(df_random), freq='D')
df_random['customer_id'] = np.random.randint(1, 20, df_random.shape[0])
df_random

Output df_random

        0          1               2    order_date  customer_id
0   0.018473    0.970257    0.605428    2015-01-01    12
    ... ... ... ... ... ...
    1999    0.800139    0.746605    0.551530    2020-06-22  11

Code to extract mean unique transactions month and year wise

for y in (2015,2019):
for x in (1,13):
    df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
    df2.sort_values(['customer_id','order_date'],inplace=True)
    df2["days"] = df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D"))
    df_mean=round(df2['days'].mean(),2)
    data2 = data.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
    print(data2)

Expected output

  Mean         Month  Year
0   5.00          1  2015
    .......................
11  6.62         12  2015

..............Mean values of days after which one transaction occurs in order_date for years 2016 and 2017 Jan to Dec

36  6.03          1  2018
..........................
47  6.76         12  2018
48  8.40          1  2019
.......................
48  8.40         12  2019

Basically I want single dataframe starting from 2015 Jan month to 2019 December

Instead of the expected output I am getting dataframe from Jan 2015 to Dec 2018 , then again Jan 2015 data and then the entire dataset repeats again from 2015 to 2018 many more times.

Please help


Solution

  • Try this:

    data2 = pd.DataFrame([])
    for y in range(2015,2020):
        for x in range(1,13):
            df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
            df_mean=df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")).mean().round(2)
            data2 = data2.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
    
    print(data2)
    

    Try this :

    df_random.order_date = pd.to_datetime(df_random.order_date)
    df_random = df_random.set_index(pd.DatetimeIndex(df_random['order_date']))
    output = df_random.groupby(pd.Grouper(freq="M"))[[0,1,2]].agg(np.mean).reset_index()
    output['month'] = output.order_date.dt.month
    output['year'] = output.order_date.dt.year
    output = output.drop('order_date', axis=1)
    output
    

    Output

    0   1   2   month   year
    0   0.494818    0.476514    0.496059    1   2015
    1   0.451611    0.437638    0.536607    2   2015
    2   0.476262    0.567519    0.528129    3   2015
    3   0.519229    0.475887    0.612433    4   2015
    4   0.464781    0.430593    0.445455    5   2015
    ... ... ... ... ... ...
    61  0.416540    0.564928    0.444234    2   2020
    62  0.553787    0.423576    0.422580    3   2020
    63  0.524872    0.470346    0.560194    4   2020
    64  0.530440    0.469957    0.566077    5   2020
    65  0.584474    0.487195    0.557567    6   2020