Search code examples
pythonpandasdataframegroup-by

Update dataframe column based on groupby count value


Hit a roadblock and hence asking the larger community.

I have an input dataframe that has employee id and a dept id.

First I need to get the group by count per DeptID .

Then I want to aggregate all entries where count is less than 10 and replace the deptID with first 3 letter and suffixed by _merged and add them to the count of this merged DeptID .

I am able to get the first half done but hitting a roadblock with the second half.

Later on I may also need to apply a hard limit on the merged dept id count i.e. not to exceed say 100.

Input dataframe:

EmpID     DeptID 

8959236   YTCD
8597697   YTCZ
9312115   YTCD
2931213   YTCQ
9312197   QCVZ
4912748   YKKC
5474785   QCIO
1575364   MLQD
1575878   PCJZ 
2915764   QCIO

Get aggregate count by dept id:

agg_count_by_DeptID  = df.groupby("DeptID")["DeptID"].count()

agg_count_by_DeptID pandas series:

YTCD   20
YTCZ    2    
YTCQ    4    
YTCL    1    
QCVZ   25 
YKKC   11
QCIO    6
MLQD    9
PCJZ    7
QCIK    2

Output Logic

Iterate through the panda series,

  • First get all elements that share the first 3 dept id characters like YTCD, YTCZ, YTCQ and YTCL
  • Sort them in increasing order ( not necessary but good to have)
  • Then check the count and,
    • if count is greater than 10 then leave it.
    • if count is less than 10 then replace by FIRST-3-CHARS_merged
  • Finally update input data frame with new deptID

expected output pandas series:

YTCD   20
YTC_merged 7    
QCVZ   25 
YKKC   11
QCI_merged  8    
MLQ_merged    9
PCJ_merged    7

Note : how counts of YTCZ, YTCQ and YTCL got summed to YTC_merged as all were less than 10 , where as YTCD was left as it is

Output dataframe:

EmpID     DeptID 

8959236   YTCD
8597697   YTC_merged
9312115   YTCD
2931213   YTC_merged
9312197   QCVZ
4912748   YKKC
5474785   QCI_merged
1575364   MLQ_merged
1575878   PCJ_merged 
2915764   QCI_merged

Solution

  • Code

    The example needs to be complete in itself. Since there are no items in your MRE with a count greater than 10, I'll set the threshold to 1.

    Here's the code that would leave the count as is if it's greater than 1, and extract only the first three characters and add '_merged' if it's equal to or less than 1 in your example.

    thresh = 1 # you can change thresh to 10
    cond = df.groupby('DeptID')['EmpID'].transform('count') > thresh
    df['DeptID'] = df['DeptID'].where(cond, df['DeptID'].str[:3] + '_merged')
    

    df

         EmpID      DeptID
    0  8959236        YTCD
    1  8597697  YTC_merged
    2  9312115        YTCD
    3  2931213  YTC_merged
    4  9312197  QCV_merged
    5  4912748  YKK_merged
    6  5474785        QCIO
    7  1575364  MLQ_merged
    8  1575878  PCJ_merged
    9  2915764        QCIO
    

    Example Code

    import pandas as pd
    data = {'EmpID': [8959236, 8597697, 9312115, 2931213, 9312197, 4912748, 5474785, 1575364, 1575878, 2915764], 'DeptID': ['YTCD', 'YTCZ', 'YTCD', 'YTCQ', 'QCVZ', 'YKKC', 'QCIO', 'MLQD', 'PCJZ', 'QCIO']}
    df = pd.DataFrame(data)