Search code examples
apache-sparkapache-spark-mlliblogistic-regressionapache-spark-ml

PySpark mllib p-values for logistic regression


I am currently running a logistic regression in PySpark using the ML-Lib package (Spark Version 2.1). In order to make sense of the coefficients and check their statistical significance, I would like to investigate the corresponding p-values.

Is there any way to get the p-vales using the ML-Lib package?


Solution

  • You can use the Generalized Linear Regression Package from the ML-library to receive p-values for a logistic regression:

    from pyspark.ml.regression import GeneralizedLinearRegression
    glr = GeneralizedLinearRegression(family="binomial", link="logit", maxIter=10, 
    regParam=0.0)
    model = glr.fit(dataset)
    summary = model.summary
    print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors))
    print("T Values: " + str(summary.tValues))
    print("P Values: " + str(summary.pValues))
    

    You can find a detailled explanation here: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#generalized-linear-regression

    Please keep in mind the eigenvalues (as well as the condition of matrix invertability) for a dataframe to receive standard errors (and thus P-values), as the package will provide you with errors in this case.