Search code examples
pythonapache-sparkpysparkapache-spark-mllib

pyspark, logistic regression, how to get coefficient of respective features


I am new to Spark, my current version is 1.3.1. And I want to implement logistic regression with PySpark, so, I found this example from Spark Python MLlib

from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.regression import LabeledPoint
from numpy import array

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)

# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)

# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))

And I found the attributes of model are:

In [21]: model.<TAB>
model.clearThreshold  model.predict         model.weights
model.intercept       model.setThreshold  

How can I get the coefficients of logistic regression?


Solution

  • As you noticed the way to obtain the coefficients is by using LogisticRegressionModel's attributes.

    Parameters:

    weights – Weights computed for every feature.

    intercept – Intercept computed for this model. (Only used in Binary Logistic Regression. In Multinomial Logistic Regression, the intercepts will not be a single value, so the intercepts will be part of the weights.)

    numFeatures – the dimension of the features.

    numClasses – the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default, it is binary logistic regression so numClasses will be set to 2.

    Don't forget that hθ(x) = 1 / exp ^ -(θ0 + θ1 * x1 + ... + θn * xn) where θ0 represents the intercept, [θ1,...,θn] the weights, and the number of features is n.

    Edit

    As you can see this is the way how the prediction is done, you can check LogisticRegressionModel's source.

    def predict(self, x):
        """
        Predict values for a single data point or an RDD of points
        using the model trained.
        """
        if isinstance(x, RDD):
            return x.map(lambda v: self.predict(v))
    
        x = _convert_to_vector(x)
        if self.numClasses == 2:
            margin = self.weights.dot(x) + self._intercept
            if margin > 0:
                prob = 1 / (1 + exp(-margin))
            else:
                exp_margin = exp(margin)
                prob = exp_margin / (1 + exp_margin)
            if self._threshold is None:
                return prob
            else:
                return 1 if prob > self._threshold else 0
        else:
            best_class = 0
            max_margin = 0.0
            if x.size + 1 == self._dataWithBiasSize:
                for i in range(0, self._numClasses - 1):
                    margin = x.dot(self._weightsMatrix[i][0:x.size]) + \
                        self._weightsMatrix[i][x.size]
                    if margin > max_margin:
                        max_margin = margin
                        best_class = i + 1
            else:
                for i in range(0, self._numClasses - 1):
                    margin = x.dot(self._weightsMatrix[i])
                    if margin > max_margin:
                        max_margin = margin
                        best_class = i + 1
            return best_class