scala apache-spark machine-learning apache-spark-mllib apache-spark-ml

Scaling dataset with MLlib

I was doing some scaling on below dataset using spark MLlib:

+---+--------------+
| id|      features|
+---+--------------+
|  0|[1.0,0.1,-1.0]|
|  1| [2.0,1.1,1.0]|
|  0|[1.0,0.1,-1.0]|
|  1| [2.0,1.1,1.0]|
|  1|[3.0,10.1,3.0]|
+---+--------------+

You can find the link of this dataset at https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml-scaling/part-00000-cd03406a-cc9b-42b0-9299-1e259fdd9382-c000.gz.parquet

After performing standard scaling I am getting the below result:

+---+--------------+------------------------------------------------------------+
|id |features      |stdScal_06f7a85f98ef__output                                |
+---+--------------+------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
|0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
|1  |[3.0,10.1,3.0]|[3.5856858280031805,2.3609991401715313,1.7928429140015902]  |
+---+--------------+------------------------------------------------------------+

If I perform min/max scaling (setting val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features")), I get the below:

+---+--------------+-------------------------------+
| id|      features|minMaxScal_21493d63e2bf__output|
+---+--------------+-------------------------------+
|  0|[1.0,0.1,-1.0]|                  [5.0,5.0,5.0]|
|  1| [2.0,1.1,1.0]|                  [7.5,5.5,7.5]|
|  0|[1.0,0.1,-1.0]|                  [5.0,5.0,5.0]|
|  1| [2.0,1.1,1.0]|                  [7.5,5.5,7.5]|
|  1|[3.0,10.1,3.0]|               [10.0,10.0,10.0]|
+---+--------------+-------------------------------+

Please find the code below:

// loading dataset
val scaleDF = spark.read.parquet("/data/simple-ml-scaling")
// using standardScaler
import org.apache.spark.ml.feature.StandardScaler
val ss = new StandardScaler().setInputCol("features") 
ss.fit(scaleDF).transform(scaleDF).show(false)

// using min/max scaler
import org.apache.spark.ml.feature.MinMaxScaler
val minMax = new MinMaxScaler().setMin(5).setMax(10).setInputCol("features") 
val fittedminMax = minMax.fit(scaleDF) 
fittedminMax.transform(scaleDF).show()

I know the formula for standarization and min/max scaling but unable to understand how it comes to the values in third column, please help me explain the math behind it.

Solution

MinMaxScaler in Spark works on each feature individually. From the documentation we have:

Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling.

$$ Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min $$

[...]

So each column in the features array will be scaled separately. In this case, the MinMaxScaler is set to have a minimum value of 5 and a maximum value of 10.

The calculation for each column will thus be:

In the first column, the min value is 1.0 and the maximum is 3.0. We have 1.0 -> 5.0, and 3.0 -> 10.0. 2.0 will there for become 7.5.
In the second column, the min value is 0.1 and the maximum is 10.1. We have 0.1 -> 5.0 and 10.1 -> 10.0. The only other value in the column is 1.1 which will become ((1.1-0.1) / (10.1-0.1)) * (10.0 - 5.0) + 5.0 = 5.5 (following the normal min-max formula).
In the third column, the min value is -1.0 and the maximum is 3.0. So we know -1.0 -> 5.0 and 3.0 -> 10.0. For 1.0 it's in the middle and will become 7.5.