Search code examples

pyspark, logistic regression, how to get coefficient of respective features

I am new to Spark, my current version is 1.3.1. And I want to implement logistic regression with PySpark, so, I found this example from Spark Python MLlib

from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.regression import LabeledPoint
from numpy import array

# Load and parse the data
def parsePoint(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

data = sc.textFile("data/mllib/sample_svm_data.txt")
parsedData =

# Build the model
model = LogisticRegressionWithLBFGS.train(parsedData)

# Evaluating the model on training data
labelsAndPreds = p: (p.label, model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count())
print("Training Error = " + str(trainErr))

And I found the attributes of model are:

In [21]: model.<TAB>
model.clearThreshold  model.predict         model.weights
model.intercept       model.setThreshold  

How can I get the coefficients of logistic regression?


  • As you noticed the way to obtain the coefficients is by using LogisticRegressionModel's attributes.


    weights – Weights computed for every feature.

    intercept – Intercept computed for this model. (Only used in Binary Logistic Regression. In Multinomial Logistic Regression, the intercepts will not be a single value, so the intercepts will be part of the weights.)

    numFeatures – the dimension of the features.

    numClasses – the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default, it is binary logistic regression so numClasses will be set to 2.

    Don't forget that hθ(x) = 1 / exp ^ -(θ0 + θ1 * x1 + ... + θn * xn) where θ0 represents the intercept, [θ1,...,θn] the weights, and the number of features is n.


    As you can see this is the way how the prediction is done, you can check LogisticRegressionModel's source.

    def predict(self, x):
        Predict values for a single data point or an RDD of points
        using the model trained.
        if isinstance(x, RDD):
            return v: self.predict(v))
        x = _convert_to_vector(x)
        if self.numClasses == 2:
            margin = + self._intercept
            if margin > 0:
                prob = 1 / (1 + exp(-margin))
                exp_margin = exp(margin)
                prob = exp_margin / (1 + exp_margin)
            if self._threshold is None:
                return prob
                return 1 if prob > self._threshold else 0
            best_class = 0
            max_margin = 0.0
            if x.size + 1 == self._dataWithBiasSize:
                for i in range(0, self._numClasses - 1):
                    margin =[i][0:x.size]) + \
                    if margin > max_margin:
                        max_margin = margin
                        best_class = i + 1
                for i in range(0, self._numClasses - 1):
                    margin =[i])
                    if margin > max_margin:
                        max_margin = margin
                        best_class = i + 1
            return best_class