I have the following dataframe:
ID Name Study City Grade Group_ID
1 Vir BE Ahm 1 1
2 Vish ME Rjk 2 2
3 Keval B.sc Beng 4 3
4 Vir M.sc Ahm 1 4
5 Priya B.com Una 2 5
6 Vish ME Rjk 2 2
7 Keval B.sc Pune 4 3
8 Vish ME Ahm 2 NaN
9 Keval B.sc Pune 4 NaN
10 Yesha M.sc Beng 2 NaN
11 Vasu M.Tech Hyd 0 6
12 Kria B.Tech Mub 6 7
I want to increase the number of group_ID columns by name, city and grade column. If the value of the name, city and grade column is present in the dataframe, then the value of Group_ID is to be taken. if not present then increase the Group_ID value by a specific number or max(Group_ID + 1).
What is the best way to get the result of the dataframe below?
ID Name Study City Grade Group_ID
1 Vir BE Ahm 1 1
2 Vish ME Rjk 2 2
3 Keval B.sc Beng 4 3
4 Vir M.sc Ahm 1 4
5 Priya B.com Una 2 5
6 Vish ME Rjk 2 2
7 Keval B.sc Pune 4 3
8 Vish ME Ahm 2 2
9 Keval B.sc Pune 4 3
10 Yesha M.sc Beng 2 8
11 Vasu M.Tech Hyd 0 6
12 Kria B.Tech Mub 6 7
I have tried by df['Group_ID'] = df.groupby(['Name', 'City', 'Grade'], sort=False).ngroup() + 8 but in won't worked for present values of column. I believe there will be an easier way to do this.
Use:
#replace missing values by existing groups
s = df.groupby(["Name", "City", "Grade"])["Group_ID"].transform('first')
df['Group_ID'] = df['Group_ID'].fillna(s)
#mask for missing values of not existing groups
mask = s.isna()
#only for these groups use ngroup with add maximal Group_ID
df.loc[mask, 'Group_ID'] = (df[mask].groupby(['Name', 'City', 'Grade'], sort=False).ngroup()
+ df["Group_ID"].max() + 1)
#convert to integers
df['Group_ID'] = df['Group_ID'].astype(int)
print (df)
ID Name Study City Grade Group_ID
0 1 Vir BE Ahm 1 1
1 2 Vish ME Rjk 2 2
2 3 Keval B.sc Beng 4 3
3 4 Vir M.sc Ahm 1 4
4 5 Priya B.com Una 2 5
5 6 Vish ME Rjk 2 2
6 7 Keval B.sc Pune 4 3
7 8 Vish ME Ahm 2 8
8 9 Keval B.sc Pune 4 3
9 10 Yesha M.sc Beng 2 9
10 11 Vasu M.Tech Hyd 0 6
11 12 Kria B.Tech Mub 6 7
Output is different, because Vish,Rjk,2
not match Vish,Ahm,2
, so new Group_ID
was created.