Search code examples
python-3.xpandasdataframegroup-by

Apply sklearn.preprocessing.MinMaxScaler function based on specific group/id in Python


I have a dataframe given as such:

#Load the required libraries
import pandas as pd


#Create dataset
data = {'id': ['A', 'A', 'A', 'A', 'A','A', 'A',
               'B', 'B', 'B', 'B', 'B', 'B',
               'C', 'C', 'C', 'C', 'C',
               'D', 'D', 'D', 'D',
               'E', 'E', 'E', 'E', 'E','E'],
        'cycle': [1,2, 3, 4, 5,6,7,
                   1,2, 3,4,5,6,
                   1,2, 3, 4, 5,
                  1,2, 3, 4,
                  1,2, 3, 4, 5,6,],
        'Salary': [7, 7, 7,8,9,10,15,
                   4, 4, 4,4,5,6,
                   8,9,10,12,13,
                   8,9,10,11,
                   7, 11,12,13,14,15,],
        'Jobs': [123, 18, 69, 65, 120, 11, 52,
                 96, 120,10, 141, 52,6,
                 101,99, 128, 1, 141,
                 141,123, 12, 66,
                 12, 128, 66, 100, 141, 52,],
        'Days': [123, 128, 66, 66, 120, 141, 52,
                 96, 120,120, 141, 52,96,
                 15,123, 128, 120, 141,
                 141,123, 128, 66,
                 123, 128, 66, 120, 141, 52,],
        }

#Convert to dataframe
df = pd.DataFrame(data)
print("df = \n", df)

The above dataframe looks as such:

enter image description here

Here, I wish to apply sklearn.preprocessing.MinMaxScaler on the columns 'Salary', 'Jobs', 'Days' corresponding to that specific group/id as such:

enter image description here

Can somebody please let me know how to achieve this task in Python?


Solution

  • You can use groupby to compute minmax scale per group:

    cols = ['Salary', 'Jobs', 'Days']
    minmax_scale = lambda x: (x - x.min(axis=0)) / (x.max(axis=0) - x.min(axis=0))
    df[cols] = df.groupby('id')[cols].apply(minmax_scale)
    

    Output:

    >>> df
       id  cycle    Salary      Jobs      Days
    0   A      1  0.000000  1.000000  0.797753
    1   A      2  0.000000  0.062500  0.853933
    2   A      3  0.000000  0.517857  0.157303
    3   A      4  0.125000  0.482143  0.157303
    4   A      5  0.250000  0.973214  0.764045
    5   A      6  0.375000  0.000000  1.000000  # Max for Days of Group A
    6   A      7  1.000000  0.366071  0.000000  # Min for Days of Group A
    7   B      1  0.000000  0.666667  0.494382
    8   B      2  0.000000  0.844444  0.764045
    9   B      3  0.000000  0.029630  0.764045
    10  B      4  0.000000  1.000000  1.000000
    11  B      5  0.500000  0.340741  0.000000
    12  B      6  1.000000  0.000000  0.494382
    13  C      1  0.000000  0.714286  0.000000
    14  C      2  0.200000  0.700000  0.857143
    15  C      3  0.400000  0.907143  0.896825
    16  C      4  0.800000  0.000000  0.833333
    17  C      5  1.000000  1.000000  1.000000
    18  D      1  0.000000  1.000000  1.000000
    19  D      2  0.333333  0.860465  0.760000
    20  D      3  0.666667  0.000000  0.826667
    21  D      4  1.000000  0.418605  0.000000
    22  E      1  0.000000  0.000000  0.797753
    23  E      2  0.500000  0.899225  0.853933
    24  E      3  0.625000  0.418605  0.157303
    25  E      4  0.750000  0.682171  0.764045
    26  E      5  0.875000  1.000000  1.000000
    27  E      6  1.000000  0.310078  0.000000
    

    As suggested by @mozway, you can use a function or the walrus operator in lambda function:

    # The fastest
    def minmax_scale(x):
        xmin = x.min(axis=0)
        xmax = x.max(axis=0)
        return (x - xmin) / (xmax - xmin)
    
    # Average performance, using walrus operator (Python >= 3.8)
    minmax_scale = lambda x: (x - (m := x.min(axis=0))) / (x.max(axis=0) - m)
    
    # The slowest
    minmax_scale = lambda x: (x - x.min(axis=0)) / (x.max(axis=0) - x.min(axis=0))