Search code examples
pandasnumpystatisticsmode

Generating column mode in data panel


I have an unbalanced data panel where each period a student can have a certain level/type of scholarship:

head(df)

ID     student_period         scholarship
   
4567        1              scholarship_level_1
4567        2              scholarship_level_2
4567        3              scholarship_level_2
4567        4              scholarship_level_3
5478        4              scholarship_level_3
5478        5              scholarship_level_3
6758        7              scholarship_level_1
6758        8              scholarship_level_2
6758        9              scholarship_level_2

Basically, I want to create a new variable that plots the statistical mode of the scholarship level for each student ID in this panel. Something like this:

head(df1)

ID     student_period         scholarship            scholarship_mode
   
4567        1              scholarship_level_1      scholarship_level_2
4567        2              scholarship_level_2      scholarship_level_2 
4567        3              scholarship_level_2      scholarship_level_2
4567        4              scholarship_level_3      scholarship_level_2
5478        4              scholarship_level_3      scholarship_level_3
5478        5              scholarship_level_3      scholarship_level_3
6758        7              scholarship_level_1      scholarship_level_2
6758        8              scholarship_level_2      scholarship_level_2
6758        9              scholarship_level_2      scholarship_level_2


Any ideas?


Solution

  • You can use groupby+transform and value_counts:

    df['scholarship_mode'] = (df.groupby('ID')['scholarship']
                              .transform(lambda x: x.value_counts().index[0]))
    

    or mode:

    df['scholarship_mode'] = (df.groupby('ID')['scholarship']
                              .transform(lambda x: x.mode().iloc[0]))
    

    output:

         ID  student_period          scholarship     scholarship_mode
    0  4567               1  scholarship_level_1  scholarship_level_2
    1  4567               2  scholarship_level_2  scholarship_level_2
    2  4567               3  scholarship_level_2  scholarship_level_2
    3  4567               4  scholarship_level_3  scholarship_level_2
    4  5478               4  scholarship_level_3  scholarship_level_3
    5  5478               5  scholarship_level_3  scholarship_level_3
    6  6758               7  scholarship_level_1  scholarship_level_2
    7  6758               8  scholarship_level_2  scholarship_level_2
    8  6758               9  scholarship_level_2  scholarship_level_2
    

    NB. be aware that mode/value_counts can have ties, in this case only one value will be used.