Search code examples
pythonscikit-learndata-sciencenormalizing

Weird output of MinMaxScaler


On my way through learning ML stuff I am confused by the MinMaxScaler provided by sklearn. The goal is to normalize numerical data into a range of [0, 1].

Example code:

from sklearn.preprocessing import MinMaxScaler

data = [[1, 2], [3, 4], [4, 5]]
scaler = MinMaxScaler(feature_range=(0, 1))
scaledData = scaler.fit_transform(data)

Giving output:

[[0.         0.        ]
 [0.66666667 0.66666667]
 [1.         1.        ]]

The first array [1, 2] got transformed into [0, 0] which in my eyes means:

  • The ratio between the numbers is gone
  • None value has any importance (anymore) as they both got set to min-value (0).

Example of what I have expected:

[[0.1, 0.2]
 [0.3, 0.4]
 [0.4, 0.5]]

This would have saved the ratios and put the numbers into the range of 0 to 1.

What am I doing wrong or misunderstanding with MinMaxScaler here? Because thinking of things like training on timeseries, it makes no sense to transform important numbers like prices or temperatures etc into broken stuff like above?


Solution

  • MinMaxScaler finds and translates the features according to a given range with the following formula according to the documentation. So you're issue is regarding the formula used.

    Formula:

    X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
    X_scaled = X_std * (max - min) + min
    

    Let us try and see what happens when you use it on your data. You need to use numpy for this.

    from sklearn.preprocessing import MinMaxScaler
    import numpy as np
    
    scaler = MinMaxScaler()
    
    data = [[1, 2], [3, 4], [4, 5]]
    
    # min to max range is given from the feature range you specify
    min = 0
    max = 1
    
    X_std = (data - np.min(data, axis=0)) / (np.max(data, axis=0) - np.min(data, axis=0))
    
    X_scaled = X_std * (max - min) + min
    

    This returns as expected:

    array([[0.        , 0.        ],
           [0.66666667, 0.66666667],
           [1.        , 1.        ]])
    

    As for your doubts regarding using MinMaxScaler you could use StandardScaler if you have outliers that are quite different from most of the values, but are still valid data.

    StandardScaler is used the same way as MinMaxScaler, but it will scale your values so they have mean equal to 0 and standard deviation equal to 1. Since those values will be found based on all the values in the series, it is much more robust against outliers.