Hit a roadblock and hence asking the larger community.
I have an input dataframe that has employee id and a dept id.
First I need to get the group by count per DeptID .
Then I want to aggregate all entries where count is less than 10 and replace the deptID with first 3 letter and suffixed by _merged
and add them to the count of this merged DeptID .
I am able to get the first half done but hitting a roadblock with the second half.
Later on I may also need to apply a hard limit on the merged dept id count i.e. not to exceed say 100.
Input dataframe:
EmpID DeptID
8959236 YTCD
8597697 YTCZ
9312115 YTCD
2931213 YTCQ
9312197 QCVZ
4912748 YKKC
5474785 QCIO
1575364 MLQD
1575878 PCJZ
2915764 QCIO
Get aggregate count by dept id:
agg_count_by_DeptID = df.groupby("DeptID")["DeptID"].count()
agg_count_by_DeptID pandas series:
YTCD 20
YTCZ 2
YTCQ 4
YTCL 1
QCVZ 25
YKKC 11
QCIO 6
MLQD 9
PCJZ 7
QCIK 2
Output Logic
Iterate through the panda series,
FIRST-3-CHARS_merged
expected output pandas series:
YTCD 20
YTC_merged 7
QCVZ 25
YKKC 11
QCI_merged 8
MLQ_merged 9
PCJ_merged 7
Note : how counts of YTCZ, YTCQ and YTCL got summed to YTC_merged
as all were less than 10 , where as YTCD
was left as it is
Output dataframe:
EmpID DeptID
8959236 YTCD
8597697 YTC_merged
9312115 YTCD
2931213 YTC_merged
9312197 QCVZ
4912748 YKKC
5474785 QCI_merged
1575364 MLQ_merged
1575878 PCJ_merged
2915764 QCI_merged
Code
The example needs to be complete in itself. Since there are no items in your MRE with a count greater than 10, I'll set the threshold to 1.
Here's the code that would leave the count as is if it's greater than 1, and extract only the first three characters and add '_merged' if it's equal to or less than 1 in your example.
thresh = 1 # you can change thresh to 10
cond = df.groupby('DeptID')['EmpID'].transform('count') > thresh
df['DeptID'] = df['DeptID'].where(cond, df['DeptID'].str[:3] + '_merged')
df
EmpID DeptID
0 8959236 YTCD
1 8597697 YTC_merged
2 9312115 YTCD
3 2931213 YTC_merged
4 9312197 QCV_merged
5 4912748 YKK_merged
6 5474785 QCIO
7 1575364 MLQ_merged
8 1575878 PCJ_merged
9 2915764 QCIO
Example Code
import pandas as pd
data = {'EmpID': [8959236, 8597697, 9312115, 2931213, 9312197, 4912748, 5474785, 1575364, 1575878, 2915764], 'DeptID': ['YTCD', 'YTCZ', 'YTCD', 'YTCQ', 'QCVZ', 'YKKC', 'QCIO', 'MLQD', 'PCJZ', 'QCIO']}
df = pd.DataFrame(data)