On my way through learning ML stuff I am confused by the MinMaxScaler
provided by sklearn. The goal is to normalize numerical data into a range of [0, 1]
.
Example code:
from sklearn.preprocessing import MinMaxScaler
data = [[1, 2], [3, 4], [4, 5]]
scaler = MinMaxScaler(feature_range=(0, 1))
scaledData = scaler.fit_transform(data)
Giving output:
[[0. 0. ]
[0.66666667 0.66666667]
[1. 1. ]]
The first array [1, 2]
got transformed into [0, 0]
which in my eyes means:
Example of what I have expected:
[[0.1, 0.2]
[0.3, 0.4]
[0.4, 0.5]]
This would have saved the ratios and put the numbers into the range of 0 to 1.
What am I doing wrong or misunderstanding with MinMaxScaler
here? Because thinking of things like training on timeseries, it makes no sense to transform important numbers like prices or temperatures etc into broken stuff like above?
MinMaxScaler finds and translates the features according to a given range with the following formula according to the documentation. So you're issue is regarding the formula used.
Formula:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
Let us try and see what happens when you use it on your data. You need to use numpy for this.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
scaler = MinMaxScaler()
data = [[1, 2], [3, 4], [4, 5]]
# min to max range is given from the feature range you specify
min = 0
max = 1
X_std = (data - np.min(data, axis=0)) / (np.max(data, axis=0) - np.min(data, axis=0))
X_scaled = X_std * (max - min) + min
This returns as expected:
array([[0. , 0. ],
[0.66666667, 0.66666667],
[1. , 1. ]])
As for your doubts regarding using MinMaxScaler you could use StandardScaler if you have outliers that are quite different from most of the values, but are still valid data.
StandardScaler is used the same way as MinMaxScaler, but it will scale your values so they have mean equal to 0 and standard deviation equal to 1. Since those values will be found based on all the values in the series, it is much more robust against outliers.