Search code examples
pythonpandasdataframetime-seriesdata-analysis

Calculate 3 months unique Emp count for a given month from last 3 months data using pandas


I am looking to calculate last 3 months of unique employee ID count using pandas. I am able to calculate unique employee ID count for current month but not sure how to do it for last 3 months.

Excel file having Date, Date Month group and Employee ID

df['DateM'] = df['Date'].dt.to_period('M')
df.groupby("DateM")["EmpId"].nunique().reset_index().rename(columns={"EmpId":"One Month Unique EMP count"}).sort_values("DateM",ascending=False).reset_index(drop=True)

testdata.xlsx Google drive link.. https://docs.google.com/spreadsheets/d/1Kaguf72YKIsY7rjYfctHop_OLIgOvIaS/edit?usp=sharing&ouid=117123134308310688832&rtpof=true&sd=true

After using above groupby command I get output for 1 month groups based on DateM column which correct.

Output for 1 month group unique count

Similarly I'm looking for another column where 3 months unique active user count based on EmpId is calculated.

Sample output:

Final output with last 3 months unique count from current month

I tried calculating same using rolling window but it doesn't help. Even I tried creating period for last 3 months and also search it before asking this question. Thanks for your help in advance, otherwise I'll have to calculate it manually.


Solution

  • I don't know if you are looking for 3 consecutive months or something else because your date discontinues at 2022-09 to 2022-10.

    I also don't know your purpose, so I give a general solution here. In case you only want to count unique for every 3 consecutive months, then it is much easier. The solution here gives you the list of unique empid for every 3 consecutive months. Note that: this means for 2022-08, I will count 3 consecutive months as 2022-08, 2022-09, and 2022-10. And so on

    # Sort data:
    df.sort_values(by='datem', inplace=True, ignore_index=True)
    
    # Create `dfu` which is `df` with unique `empid` for each `datem` only:
    dfu = df.groupby(['datem', 'empid']).count().reset_index()
    dfu.rename(columns={'date':'count'}, inplace=True)
    dfu.sort_values(by=['datem', 'empid'], inplace=True, ignore_index=True)
    dfu
    
    # Obtain the list of unique periods:
    unique_period = dfu['datem'].unique()
    
    # Create empty dataframe:
    dfe = pd.DataFrame(columns=['datem', 'empid', 'start_period'])
    
    for p in unique_period:
        # Create 3 consecutive range:
        tem_range = pd.period_range(start=p, freq='M', periods=3)
        
        # Extract dataframe from `dfu` with period in range wanted:
        tem_dfu = dfu.loc[dfu['datem'].isin(tem_range),:].copy()
        
        # Some cleaning:
        tem_dfu.drop_duplicates(subset='empid', keep='first')
        tem_dfu.drop(columns='count', inplace=True)
        tem_dfu['start_period'] = p
        
        # Concat and obtain desired output:
        dfe = pd.concat([dfe, tem_dfu])
    
    dfe
    

    Hope this is what you are looking for