Search code examples
pandasdataframenumpypandas-groupby

Pandas - Groupby by three columns with cumsum or cumcount or nggroup


I have the following dataframe:

ID    Name    Study   City   Grade   Group_ID
1     Vir     BE      Ahm     1        1
2     Vish    ME      Rjk     2        2
3     Keval   B.sc    Beng    4        3
4     Vir     M.sc    Ahm     1        4
5     Priya   B.com   Una     2        5
6     Vish    ME      Rjk     2        2
7     Keval   B.sc    Pune    4        3
8     Vish    ME      Ahm     2        NaN
9     Keval   B.sc    Pune    4        NaN
10    Yesha   M.sc    Beng    2        NaN  
11    Vasu    M.Tech  Hyd     0        6
12    Kria    B.Tech  Mub     6        7

I want to increase the number of group_ID columns by name, city and grade column. If the value of the name, city and grade column is present in the dataframe, then the value of Group_ID is to be taken. if not present then increase the Group_ID value by a specific number or max(Group_ID + 1).

What is the best way to get the result of the dataframe below?

ID    Name    Study   City   Grade   Group_ID
1     Vir     BE      Ahm     1        1
2     Vish    ME      Rjk     2        2
3     Keval   B.sc    Beng    4        3
4     Vir     M.sc    Ahm     1        4
5     Priya   B.com   Una     2        5
6     Vish    ME      Rjk     2        2
7     Keval   B.sc    Pune    4        3
8     Vish    ME      Ahm     2        2
9     Keval   B.sc    Pune    4        3
10    Yesha   M.sc    Beng    2        8  
11    Vasu    M.Tech  Hyd     0        6
12    Kria    B.Tech  Mub     6        7

I have tried by df['Group_ID'] = df.groupby(['Name', 'City', 'Grade'], sort=False).ngroup() + 8 but in won't worked for present values of column. I believe there will be an easier way to do this.


Solution

  • Use:

    #replace missing values by existing groups
    s = df.groupby(["Name", "City", "Grade"])["Group_ID"].transform('first')
    df['Group_ID'] = df['Group_ID'].fillna(s)
    #mask for missing values of not existing groups
    mask = s.isna()
    #only for these groups use ngroup with add maximal Group_ID
    df.loc[mask, 'Group_ID'] = (df[mask].groupby(['Name', 'City', 'Grade'], sort=False).ngroup()
                                 + df["Group_ID"].max() + 1)
    #convert to integers
    df['Group_ID'] = df['Group_ID'].astype(int)
    print (df)
        ID   Name   Study  City  Grade  Group_ID
    0    1    Vir      BE   Ahm      1         1
    1    2   Vish      ME   Rjk      2         2
    2    3  Keval    B.sc  Beng      4         3
    3    4    Vir    M.sc   Ahm      1         4
    4    5  Priya   B.com   Una      2         5
    5    6   Vish      ME   Rjk      2         2
    6    7  Keval    B.sc  Pune      4         3
    7    8   Vish      ME   Ahm      2         8
    8    9  Keval    B.sc  Pune      4         3
    9   10  Yesha    M.sc  Beng      2         9
    10  11   Vasu  M.Tech   Hyd      0         6
    11  12   Kria  B.Tech   Mub      6         7
    

    Output is different, because Vish,Rjk,2 not match Vish,Ahm,2, so new Group_ID was created.