Search code examples
pythonpandaspysparkwindow-functionsmoving-average

Set NaN when mean divisors aren't enough


I have this dataset, when there isn't enough data in the 3 or 6 months window, the service registers the entry as NaN.

Dataset:

df = pd.DataFrame({'customer':['C1000','C1000','C1000','C1000','C1000','C1000','C1000','C1000','C2000','C2000','C2000','C2000'],
                    'account': ['A1100','A1100','A1100','A1200','A1200','A1300','A1300','A1300','A2100','A2100','A2100','A2100'],
                    'month':   ['2019-10-01','2019-11-01','2019-12-01','2019-10-01','2019-11-01','2019-10-01','2019-11-01','2019-12-01','2019-09-01','2019-10-01','2019-11-01','2019-12-01'],
                    'invoice': [34000,55000,80000,90000,55000,10000,10000,20000,45000,78000,55000,80000]
                  })

The expected result is like this: Note the NaN value when we don't have 3 full months of data.

+--------+-------+--------------------------+
|customer|account|avg_invoices_last_3_months|
+--------+-------+--------------------------+
|C1000   |A1100  |41,333                    |
|C1000   |A1200  |NaN                       |
...

I tried this transformation:

# Count how many rows has in each month
df['cnt_month'] = df.groupby(['customer','account']).transform('count')
# At this point, both columns receive NaN value, but the invoice column can't 
# be change
df.loc[df.cnt_month < 4] = 'NaN'
# Here, I need to group by customer and account for invoice values 
# ​​also NaN and numeric in the grouping like in the "The expected result"
df.groupby(['customer','account','month'])['invoice'].mean()

But, the result isn't working.


Solution

  • I think you could substitute 'NaN', and then tally it with 'invoice'. Does it meet the intent of the question?

    df.loc[df.cnt_month < 4, 'invoice'] = np.nan
    df.groupby(['customer','account'])['invoice'].agg(avg_invoices_last_3_months=('invoice','mean')).reset_index()
    
        customer    account avg_invoices_last_3_months
    0   C1000   A1100   NaN
    1   C1000   A1200   NaN
    2   C1000   A1300   NaN
    3   C2000   A2100   64500.0