I have a dataframe in scala that looks like this
df.show
+---+-----+-------------------+--------+------------------+--------+------+------------+-------------+
| id|group| normalized_amount|query_id| y| y1|group1|groupIndexed| groupEncoded|
+---+-----+-------------------+--------+------------------+--------+------+------------+-------------+
| 1| B| 0.22874172014806| 1| 0.317739988492575| 0| B| 1.0|(2,[1],[1.0])|
| 2| A| -1.42432215217563| 2| -1.32008967486074| 0| C| 0.0|(2,[0],[1.0])|
| 3| B| -2.03644548423379| 3| -1.65740392834359| 0| B| 1.0|(2,[1],[1.0])|
| 4| B| 0.425753803902096| 4|-0.127591370989296| 0| C| 0.0|(2,[0],[1.0])|
| 5| A| 0.521050829955076| 5| 0.824285664580579| 1| A| 2.0| (2,[],[])|
| 6| A|-0.0416682439998418| 6| 0.321350404322885| 1| C| 0.0|(2,[0],[1.0])|
| 7| A| -1.2787327462978| 7| -0.88099379032367| 0| A| 2.0| (2,[],[])|
| 8| A| 0.431780409975322| 8| 0.575249966796747| 1| C| 0.0|(2,[0],[1.0])|
And I'm performing a linear regression of y
on group1
(a categorical variable of 3 categories) and normalized_amount
(a continuous variable) as follows
var assembler = new VectorAssembler().setInputCols(Array("groupEncoded", "normalized_amount")).setOutputCol("features")
val dfFeatures = assembler.transform(df)
var lr = new LinearRegression()
var lrModel = lr.fit(dfFeatures)
var lrPrediction = lrModel.transform(dfFeatures)
I can access coefficients and standard errors as follows
lmModel.intercept
lrModel.coefficients //model coefficient estimates (not intercept)
lrModel.summary.coefficientStandardErrors //standard error of intercept and coefficients, not sure in which order
My questions are
I've seen some answers to similar questions, but they are all in pyspark and not in scala, and I'm only using scala
With a dataframe as your transformed df, that includes the prediction, and LogisticRegressionModel, you can access to the attributes of the VectorAssembler field. This code from databricks, I slightly adapted it for a LogisticRegressionModel instead of Pipeline. Note that you can choose if you want intercept estimation or not:
val lrToFit : LinearRegression = ???
lrToFit.setFitIntercept(false)
// With this dataframe as your transformed df that includes the prediction
val df: DataFrame = ???
val lr : LogisticRegressionModel = ???
val schema = df.schema
// Using the schema, the attributes of the Vector Assembler(features) can be extracted
val features = AttributeGroup.fromStructField(schema(lr.getFeaturesCol)).attributes.get.map(_.name.get)
val featureNames: Array[String] = if (lr.getFitIntercept) {
Array("(Intercept)") ++ features
} else {
features
}
val coefficients = lr.coefficients.toArray
val coeffs = if (lr.getFitIntercept) {
coefficients ++ Array(lr.intercept)
} else {
coefficients
}
featureNames.zip(coeffs).foreach { case (feature, coeff) =>
println(s"$feature\t$coeff")
}
This is a method that can be used if you load a pretrained model because in that case you might not know the order of the features in the VectorAssembler transformation. I think that you will need to select the reference category manually.