Does anyone know what the default reference group is in a Pyspark multinomial logistic regression. For instance, we have multiclass outcomes/target of A, B, C, and D
.
How does spark choose the reference category? In standard logistic regression in other software (e.g. R
, SAS
) you can set the reference group yourself. So if your reference is A
, you get n-1
models fitted together and having the target classes modeled as A vs B, A vs C, and A vs D
.
You want to control this process because if an outcome with a low number of values (small sample of observation) is set as a reference the estimates are going to be unstable.
Here is the link to the multinomial logistic regression model in pyspark. Here the outcome classes are 0, 1, 2 but no clarity on what the reference is. I am assuming it may be zero but not sure of that.
I believe that by default, it does not by use a reference group. This is why, if you run the snip from your link, you find non-zero values for all intercepts.
From the scala source: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala
- Note that there is a difference between multinomial (softmax) and binary loss. The binary case
- uses one outcome class as a "pivot" and regresses the other class against the pivot. In the
- multinomial case, the softmax loss function is used to model each class probability
- independently. Using softmax loss produces
K
sets of coefficients, while using a pivot class- produces
K - 1
sets of coefficients (a single coefficient vector in the binary case). In the- binary case, we can say that the coefficients are shared between the positive and negative
- classes...
It goes on to talk about how coefficients are not generally identifiable (which is why one would pick a reference label), but that when the regularization is applied the coefficients do become identifiable.