I was doing some scaling on below dataset using spark MLlib:
+---+--------------+
| id| features|
+---+--------------+
| 0|[1.0,0.1,-1.0]|
| 1| [2.0,1.1,1.0]|
| 0|[1.0,0.1,-1.0]|
| 1| [2.0,1.1,1.0]|
| 1|[3.0,10.1,3.0]|
+---+--------------+
You can find the link of this dataset at https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml-scaling/part-00000-cd03406a-cc9b-42b0-9299-1e259fdd9382-c000.gz.parquet
After performing standard scaling I am getting the below result:
+---+--------------+------------------------------------------------------------+
|id |features |stdScal_06f7a85f98ef__output |
+---+--------------+------------------------------------------------------------+
|0 |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1 |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968] |
|0 |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1 |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968] |
|1 |[3.0,10.1,3.0]|[3.5856858280031805,2.3609991401715313,1.7928429140015902] |
+---+--------------+------------------------------------------------------------+
If I perform min/max scaling (setting val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features")
), I get the below:
+---+--------------+-------------------------------+
| id| features|minMaxScal_21493d63e2bf__output|
+---+--------------+-------------------------------+
| 0|[1.0,0.1,-1.0]| [5.0,5.0,5.0]|
| 1| [2.0,1.1,1.0]| [7.5,5.5,7.5]|
| 0|[1.0,0.1,-1.0]| [5.0,5.0,5.0]|
| 1| [2.0,1.1,1.0]| [7.5,5.5,7.5]|
| 1|[3.0,10.1,3.0]| [10.0,10.0,10.0]|
+---+--------------+-------------------------------+
Please find the code below:
// loading dataset
val scaleDF = spark.read.parquet("/data/simple-ml-scaling")
// using standardScaler
import org.apache.spark.ml.feature.StandardScaler
val ss = new StandardScaler().setInputCol("features")
ss.fit(scaleDF).transform(scaleDF).show(false)
// using min/max scaler
import org.apache.spark.ml.feature.MinMaxScaler
val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features")
val fittedminMax = minMax.fit(scaleDF)
fittedminMax.transform(scaleDF).show()
I know the formula for standarization and min/max scaling but unable to understand how it comes to the values in third column, please help me explain the math behind it.
MinMaxScaler
in Spark works on each feature individually. From the documentation we have:
Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.
$$ Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min $$
[...]
So each column in the features
array will be scaled separately.
In this case, the MinMaxScaler
is set to have a minimum value of 5 and a maximum value of 10.
The calculation for each column will thus be: