python apache-spark pyspark apache-spark-mllib

pyspark, logistic regression, how to get coefficient of respective features

I am new to Spark, my current version is 1.3.1. And I want to implement logistic regression with PySpark, so, I found this example from Spark Python MLlib

from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.regression import LabeledPoint
from numpy import array

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData = data.map(parsePoint)

# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)

# Evaluating the model on training data
labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))

And I found the attributes of model are:

In [21]: model.<TAB>
model.clearThreshold  model.predict         model.weights
model.intercept       model.setThreshold

How can I get the coefficients of logistic regression?

Solution

As you noticed the way to obtain the coefficients is by using LogisticRegressionModel's attributes.

Parameters:

weights – Weights computed for every feature.

intercept – Intercept computed for this model. (Only used in Binary Logistic Regression. In Multinomial Logistic Regression, the intercepts will not be a single value, so the intercepts will be part of the weights.)

numFeatures – the dimension of the features.

numClasses – the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default, it is binary logistic regression so numClasses will be set to 2.

Don't forget that hθ(x) = 1 / exp ^ -(θ0 + θ1 * x1 + ... + θn * xn) where θ0 represents the intercept, [θ1,...,θn] the weights, and the number of features is n.

Edit

As you can see this is the way how the prediction is done, you can check LogisticRegressionModel's source.

def predict(self, x):
    """
    Predict values for a single data point or an RDD of points
    using the model trained.
    """
    if isinstance(x, RDD):
        return x.map(lambda v: self.predict(v))

    x = _convert_to_vector(x)
    if self.numClasses == 2:
        margin = self.weights.dot(x) + self._intercept
        if margin > 0:
            prob = 1 / (1 + exp(-margin))
        else:
            exp_margin = exp(margin)
            prob = exp_margin / (1 + exp_margin)
        if self._threshold is None:
            return prob
        else:
            return 1 if prob > self._threshold else 0
    else:
        best_class = 0
        max_margin = 0.0
        if x.size + 1 == self._dataWithBiasSize:
            for i in range(0, self._numClasses - 1):
                margin = x.dot(self._weightsMatrix[i][0:x.size]) + \
                    self._weightsMatrix[i][x.size]
                if margin > max_margin:
                    max_margin = margin
                    best_class = i + 1
        else:
            for i in range(0, self._numClasses - 1):
                margin = x.dot(self._weightsMatrix[i])
                if margin > max_margin:
                    max_margin = margin
                    best_class = i + 1
        return best_class