python scikit-learn data-science normalizing

Weird output of MinMaxScaler

On my way through learning ML stuff I am confused by the MinMaxScaler provided by sklearn. The goal is to normalize numerical data into a range of [0, 1].

Example code:

from sklearn.preprocessing import MinMaxScaler

data = [[1, 2], [3, 4], [4, 5]]
scaler = MinMaxScaler(feature_range=(0, 1))
scaledData = scaler.fit_transform(data)

Giving output:

[[0.         0.        ]
 [0.66666667 0.66666667]
 [1.         1.        ]]

The first array [1, 2] got transformed into [0, 0] which in my eyes means:

The ratio between the numbers is gone
None value has any importance (anymore) as they both got set to min-value (0).

Example of what I have expected:

[[0.1, 0.2]
 [0.3, 0.4]
 [0.4, 0.5]]

This would have saved the ratios and put the numbers into the range of 0 to 1.

What am I doing wrong or misunderstanding with MinMaxScaler here? Because thinking of things like training on timeseries, it makes no sense to transform important numbers like prices or temperatures etc into broken stuff like above?

Solution

MinMaxScaler finds and translates the features according to a given range with the following formula according to the documentation. So you're issue is regarding the formula used.

Formula:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

Let us try and see what happens when you use it on your data. You need to use numpy for this.

from sklearn.preprocessing import MinMaxScaler
import numpy as np

scaler = MinMaxScaler()

data = [[1, 2], [3, 4], [4, 5]]

# min to max range is given from the feature range you specify
min = 0
max = 1

X_std = (data - np.min(data, axis=0)) / (np.max(data, axis=0) - np.min(data, axis=0))

X_scaled = X_std * (max - min) + min

This returns as expected:

array([[0.        , 0.        ],
       [0.66666667, 0.66666667],
       [1.        , 1.        ]])

As for your doubts regarding using MinMaxScaler you could use StandardScaler if you have outliers that are quite different from most of the values, but are still valid data.

StandardScaler is used the same way as MinMaxScaler, but it will scale your values so they have mean equal to 0 and standard deviation equal to 1. Since those values will be found based on all the values in the series, it is much more robust against outliers.