I was trying to understand the concept of the output generated from logistic regression model in Pyspark.
Could anyone please explain the concept behind the rawPrediction
field calculation generated from a logistic regression model?
In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation:
The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).
It is not there in the later versions, but you can still find it in the Scala source code.
Anyway, and any unfortunate wording aside, the rawPrecictions
in Spark ML, for the logistic regression case, is what the rest of the world call logits, i.e. the raw output of a logistic regression classifier, which is subsequently transformed into a probability score using the logistic function exp(x)/(1+exp(x))
Here is an example with toy data:
# u'2.2.0'
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
(0.0, Vectors.dense(0.0, 1.0)),
(1.0, Vectors.dense(1.0, 0.0))],
["label", "features"])
# +-----+---------+
# |label| features|
# +-----+---------+
# | 0.0|[0.0,1.0]|
# | 1.0|[1.0,0.0]|
# +-----+---------+
lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
Row(features=Vectors.dense(0.5, 0.2))]).toDF()
lr_result = lr_model.transform(test)
Here is the result:
|features | rawPrediction | probability |prediction|
|[0.2,0.5]|[0.9894187891647654,-0.9894187891647654]|[0.7289731070426124,0.27102689295738763]| 0.0 |
|[0.5,0.2]|[-0.9894187891647683,0.9894187891647683]|[0.2710268929573871,0.728973107042613] | 1.0 |
Let's now confirm that the logistic function of rawPrediction
gives the probability
import numpy as np
x1 = np.array([0.9894187891647654,-0.9894187891647654])
# array([ 0.72897311, 0.27102689])
x2 = np.array([-0.9894187891647683,0.9894187891647683])
# array([ 0.27102689, 0.72897311])
i.e. this is the case indeed
So, to summarize regarding all three (3) output columns:
is the raw output of the logistic regression classifier (array with length equal to the number of classes)probability
is the result of applying the logistic function to rawPrediction
(array of length equal to that of rawPrediction
is the argument where the array probability
takes its maximum value, and it gives the most probable label (single number)