Search code examples
apache-sparkmachine-learningclassificationapache-spark-mllib

Spark RFormula Interpretation


I was reading "Spark The Definitive Guide", i came across a code section in MLlib chapter which has the following code:

var df = spark.read.json("/data/simple-ml") 
df.orderBy("value2").show()
import org.apache.spark.ml.feature.RFormula
// Unable to understand the interpretation of this formulae
val supervised = new RFormula().setFormula("lab ~ . + color:value1 + color:value2")
val fittedRF = supervised.fit(df)
val preparedDF = fittedRF.transform(df) 
preparedDF.show()

Where /data/simple-ml contains a JSON file containing(e.g):-

"lab":"good","color":"green","value1":1,"value2":14.386294994851129 "lab":"bad","color":"blue","value1":8,"value2":14.386294994851129 "lab":"bad","color":"blue","value1":12,"value2":14.386294994851129 "lab":"good","color":"green","value1":15,"value2":38.9718713375581

you can find the complete data set at https://github.com/databricks/Spark-The-Definitive-Guide/blob/master/data/simple-ml/part-r-00000-f5c243b9-a015-4a3b-a4a8-eca00f80f04c.json and above lines produces the output as:-

[green,good,1,14.386294994851129,(10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129]),0.0]
[blue,bad,8,14.386294994851129,(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129]),1.0]
[blue,bad,12,14.386294994851129,(10,[2,3,6,9],[12.0,14.386294994851129,12.0,14.386294994851129]),1.0]
[green,good,15,38.97187133755819,(10,[0,2,3,4,7],[1.0,15.0,38.97187133755819,15.0,38.97187133755819]),0.0]

Now i am not able to understand how it is calculating the 5th(marked in bold) column value.


Solution

  • The 5-th column is a structure representing sparse vectors in Spark. It has three components:

    • vector length - in this case all vectors are of length 10 elements
    • index array holding the indices of non-zero elements
    • value array of non-zero values

    So

    (10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])
    

    represent the following sparse vector of length 10 (take the i-th value and place it in position i):

     0       2    3                   4          7
    [1.0, 0, 1.0, 14.386294994851129, 1.0, 0, 0, 14.386294994851129, 0, 0]
    

    (the positions of the non-zero elements are shown)

    What are the individual components of that vector? According to the documentation:

    RFormula produces a vector column of features and a double or string column of label. Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. If the label column is of type string, it will be first transformed to double with StringIndexer. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.

    lab ~ . + color:value1 + color:value2 is a special syntax that comes from the R language. It describes a model that regresses the value of lab on all the other features plus two interaction (product) terms. You can see the list of all features by printing fittedRF and looking at the ResolvedRFormula instance it contains:

    scala> println(fittedRF)
    RFormulaModel(
     ResolvedRFormula(
      label=lab,
      terms=[color,value1,value2,{color,value1},{color,value2}],
      hasIntercept=true
     )
    ) (uid=rFormula_0847e597e817)
    

    I've split the output in lines and indented it for readability. So . + color:value1 + color:value2 expands to [color,value1,value2,{color,value1},{color,value2}]. Of those, color is a categorical feature and it gets one-hot encoded in a set of indicator features using the following mapping:

    • green becomes [1, 0]
    • blue becomes [0, 0]
    • red becomes [0, 1]

    Although you have three categories, only two are used for the encoding. Blue in this case gets dropped since its presence has no information value - if it was there, all three columns will always sum to 1, which makes them linearly dependent. The effect of dropping the blue category is that it becomes the baseline as part of the intercept and the fitted model predicts what effect changing the category from blue to green or from blue to red will have on the label. That particular choice of encoding is a bit arbitrary - on my system the columns for red and green came out swapped.

    value1 and value2 are doubles, so they go unchanged in the feature vector. {color,value1} is the product of the color feature and the value1 feature, so that is the product of the one-hot encoding of color with the scalar value1, resulting in three new features. Notice that in this case we cannot drop one category because the interaction makes the "base" value dependent on the value of the second feature in the interaction. Same for {color,value2}. So you end up with 2 + 1 + 1 + 3 + 3 or 10 features in total. What you see in the output of show() is the assembled vector feature column that can be used as input by other Spark ML classes.

    Here is how to read the first row:

    (10,[0,2,3,4,7],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])
    

    is the sparse vector representation of

    [1.0, 0, 1.0, 14.386294994851129, 1.0, 0, 0, 14.386294994851129, 0, 0]
     |--1--| |2|  |-------3--------|  |---4---|  |----------5-----------|
    

    which contains the following individual components:

    1. [1.0, 0, ...] - color, one-hot encoding (minus the linearly dependent third category) of category green
    2. [..., 1.0, ...] - value1, value 1
    3. [..., 14.386294994851129, ...] - value2, value 14,38629...
    4. [..., 1.0, 0, 0, ...] - color x value1 interaction term, product of one-hot encoding of green ([1, 0, 0]) and 1
    5. [..., 14.386294994851129, 0, 0] - color x value2 interaction term, product of one-hot encoding of green ([1, 0, 0]) and 14,38629...