I was reading "Spark The Definitive Guide", i came across a code section in MLlib chapter which has the following code:
var df = spark.read.json("/data/simple-ml")
df.orderBy("value2").show()
import org.apache.spark.ml.feature.RFormula
// Unable to understand the interpretation of this formulae
val supervised = new RFormula().setFormula("lab ~ . + color:value1 + color:value2")
val fittedRF = supervised.fit(df)
val preparedDF = fittedRF.transform(df)
preparedDF.show()
Where /data/simple-ml contains a JSON file containing(e.g):-
"lab":"good","color":"green","value1":1,"value2":14.386294994851129 "lab":"bad","color":"blue","value1":8,"value2":14.386294994851129 "lab":"bad","color":"blue","value1":12,"value2":14.386294994851129 "lab":"good","color":"green","value1":15,"value2":38.9718713375581
you can find the complete data set at https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml/part-r-00000-f5c243b9-a015-4a3b-a4a8-eca00f80f04c.json and above lines produces the output as:-
[green,good,1,14.386294994851129,(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129]),0.0]
[blue,bad,8,14.386294994851129,(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129]),1.0]
[blue,bad,12,14.386294994851129,(10,[2,3,6,9],[12.0,14.386294994851129,12.0,14.386294994851129]),1.0]
[green,good,15,38.97187133755819,(10,[0,2,3,4,7],[1.0,15.0,38.97187133755819,15.0,38.97187133755819]),0.0]
Now i am not able to understand how it is calculating the 5th(marked in bold) column value.
The 5-th column is a structure representing sparse vectors in Spark. It has three components:
So
(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])
represent the following sparse vector of length 10 (take the i-th value and place it in position i):
0 2 3 4 7
[1.0, 0, 1.0, 14.386294994851129, 1.0, 0, 0, 14.386294994851129, 0, 0]
(the positions of the non-zero elements are shown)
What are the individual components of that vector? According to the documentation:
RFormula produces a vector column of features and a double or string column of label. Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. If the label column is of type string, it will be first transformed to double with
StringIndexer
. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.
lab ~ . + color:value1 + color:value2
is a special syntax that comes from the R language. It describes a model that regresses the value of lab
on all the other features plus two interaction (product) terms. You can see the list of all features by printing fittedRF
and looking at the ResolvedRFormula
instance it contains:
scala> println(fittedRF)
RFormulaModel(
ResolvedRFormula(
label=lab,
terms=[color,value1,value2,{color,value1},{color,value2}],
hasIntercept=true
)
) (uid=rFormula_0847e597e817)
I've split the output in lines and indented it for readability. So . + color:value1 + color:value2
expands to [color,value1,value2,{color,value1},{color,value2}]
. Of those, color
is a categorical feature and it gets one-hot encoded in a set of indicator features using the following mapping:
[1, 0]
[0, 0]
[0, 1]
Although you have three categories, only two are used for the encoding. Blue in this case gets dropped since its presence has no information value - if it was there, all three columns will always sum to 1, which makes them linearly dependent. The effect of dropping the blue category is that it becomes the baseline as part of the intercept and the fitted model predicts what effect changing the category from blue to green or from blue to red will have on the label. That particular choice of encoding is a bit arbitrary - on my system the columns for red and green came out swapped.
value1
and value2
are doubles, so they go unchanged in the feature vector. {color,value1}
is the product of the color
feature and the value1
feature, so that is the product of the one-hot encoding of color
with the scalar value1
, resulting in three new features. Notice that in this case we cannot drop one category because the interaction makes the "base" value dependent on the value of the second feature in the interaction. Same for {color,value2}
. So you end up with 2 + 1 + 1 + 3 + 3 or 10 features in total. What you see in the output of show()
is the assembled vector feature column that can be used as input by other Spark ML classes.
Here is how to read the first row:
(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])
is the sparse vector representation of
[1.0, 0, 1.0, 14.386294994851129, 1.0, 0, 0, 14.386294994851129, 0, 0]
|--1--| |2| |-------3--------| |---4---| |----------5-----------|
which contains the following individual components:
[1.0, 0, ...]
- color
, one-hot encoding (minus the linearly dependent third category) of category green[..., 1.0, ...]
- value1
, value 1
[..., 14.386294994851129, ...]
- value2
, value 14,38629...[..., 1.0, 0, 0, ...]
- color x value1
interaction term, product of one-hot encoding of green ([1, 0, 0]
) and 1[..., 14.386294994851129, 0, 0]
- color x value2
interaction term, product of one-hot encoding of green ([1, 0, 0]
) and 14,38629...