Search code examples
pythonpandasnormalize

Pandas: Normalizing all the values in one column between 0 and 10 based on groups present in another column


Suppose I have a dataframe like:

    Group  Values
0       1       1
1       1       4
2       1       2
3       1       7
4       1       3
5       2       4
6       2       1
7       2       5
8       2      12
9       2       4
10      2      10
11      3       2
12      3       6
13      3      20
14      3      15

MRE:

df = pd.DataFrame({'Group': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3], 'Values': [1, 4, 2, 7, 3, 4, 1, 5, 12, 4, 10, 2, 6, 20, 15]})

Notice in the dataframe that the group 1 has max value of 7, while group 2 has 12 and group 3 has 20. Now I want to normalize each Group's Values to a cap value of 10.

I tried to use pd.groupby method but I'm left with no clue on how to proceed further. Plus I know I could use a for loop but this would be very inefficient as I have about 20k samples in the data I'm trying to work with.

Is there a sweet and subtle method which can do the trick?


Solution

  • You can use groupby with transform for this. There's a similar example in the docs of transform():

    import pandas as pd
    
    df = pd.DataFrame({'Group': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3], 'Values': [1, 4, 2, 7, 3, 4, 1, 5, 12, 4, 10, 2, 6, 20, 15]})
    
    df['normal'] = df.groupby('Group').transform(lambda x: (x / x.max()) * 10)
    
    print(df)
    

    Prints:

        Group  Values     normal
    0       1       1   1.428571
    1       1       4   5.714286
    2       1       2   2.857143
    3       1       7  10.000000
    4       1       3   4.285714
    5       2       4   3.333333
    6       2       1   0.833333
    7       2       5   4.166667
    8       2      12  10.000000
    9       2       4   3.333333
    10      2      10   8.333333
    11      3       2   1.000000
    12      3       6   3.000000
    13      3      20  10.000000
    14      3      15   7.500000