Search code examples
scalaapache-sparkencodingregressioncategorical-data

Scala (NOT pyspark) map linear regression coefficients to feature names (categorical and continuous)


I have a dataframe in scala that looks like this

df.show
+---+-----+-------------------+--------+------------------+--------+------+------------+-------------+
| id|group|  normalized_amount|query_id|                 y|      y1|group1|groupIndexed| groupEncoded|
+---+-----+-------------------+--------+------------------+--------+------+------------+-------------+
|  1|    B|   0.22874172014806|       1| 0.317739988492575|       0|     B|         1.0|(2,[1],[1.0])|
|  2|    A|  -1.42432215217563|       2| -1.32008967486074|       0|     C|         0.0|(2,[0],[1.0])|
|  3|    B|  -2.03644548423379|       3| -1.65740392834359|       0|     B|         1.0|(2,[1],[1.0])|
|  4|    B|  0.425753803902096|       4|-0.127591370989296|       0|     C|         0.0|(2,[0],[1.0])|
|  5|    A|  0.521050829955076|       5| 0.824285664580579|       1|     A|         2.0|    (2,[],[])|
|  6|    A|-0.0416682439998418|       6| 0.321350404322885|       1|     C|         0.0|(2,[0],[1.0])|
|  7|    A|   -1.2787327462978|       7| -0.88099379032367|       0|     A|         2.0|    (2,[],[])|
|  8|    A|  0.431780409975322|       8| 0.575249966796747|       1|     C|         0.0|(2,[0],[1.0])|

And I'm performing a linear regression of y on group1 (a categorical variable of 3 categories) and normalized_amount (a continuous variable) as follows

var assembler = new VectorAssembler().setInputCols(Array("groupEncoded", "normalized_amount")).setOutputCol("features")
val dfFeatures = assembler.transform(df)
var lr = new LinearRegression()
var lrModel = lr.fit(dfFeatures)
var lrPrediction = lrModel.transform(dfFeatures)

I can access coefficients and standard errors as follows

lmModel.intercept
lrModel.coefficients //model coefficient estimates (not intercept)
lrModel.summary.coefficientStandardErrors //standard error of intercept and coefficients, not sure in which order

My questions are

  1. how can I figure out which feature correspond to which coefficient estimate (for categorical values, I need to figure out the coefficient of each category)? Same with standard errors?
  2. how can I choose which category to "leave out" as the reference category?
  3. how to perform a linear regression with no intercept?

I've seen some answers to similar questions, but they are all in pyspark and not in scala, and I'm only using scala


Solution

  • With a dataframe as your transformed df, that includes the prediction, and LogisticRegressionModel, you can access to the attributes of the VectorAssembler field. This code from databricks, I slightly adapted it for a LogisticRegressionModel instead of Pipeline. Note that you can choose if you want intercept estimation or not:

    val lrToFit : LinearRegression = ???
    lrToFit.setFitIntercept(false)
    
    // With this dataframe as your transformed df that includes the prediction
    val df: DataFrame = ???
    val lr : LogisticRegressionModel = ???
    val schema = df.schema
    
    // Using the schema, the attributes of the Vector Assembler(features) can be extracted
    val features = AttributeGroup.fromStructField(schema(lr.getFeaturesCol)).attributes.get.map(_.name.get)
    val featureNames: Array[String] = if (lr.getFitIntercept) {
      Array("(Intercept)") ++ features
    } else {
      features
    }
    
    val coefficients = lr.coefficients.toArray
    val coeffs = if (lr.getFitIntercept) {
      coefficients ++ Array(lr.intercept)
    } else {
      coefficients
    }
    
    featureNames.zip(coeffs).foreach { case (feature, coeff) =>
      println(s"$feature\t$coeff")
    }
    

    This is a method that can be used if you load a pretrained model because in that case you might not know the order of the features in the VectorAssembler transformation. I think that you will need to select the reference category manually.