Search code examples
apache-sparkpysparklogistic-regressionmultinomialmulticlass-classification

Reference group in PySpark multinomial regression


Does anyone know what the default reference group is in a Pyspark multinomial logistic regression. For instance, we have multiclass outcomes/target of A, B, C, and D.

How does spark choose the reference category? In standard logistic regression in other software (e.g. R, SAS) you can set the reference group yourself. So if your reference is A, you get n-1 models fitted together and having the target classes modeled as A vs B, A vs C, and A vs D.

You want to control this process because if an outcome with a low number of values (small sample of observation) is set as a reference the estimates are going to be unstable.

Here is the link to the multinomial logistic regression model in pyspark. Here the outcome classes are 0, 1, 2 but no clarity on what the reference is. I am assuming it may be zero but not sure of that.


Solution

  • I believe that by default, it does not by use a reference group. This is why, if you run the snip from your link, you find non-zero values for all intercepts.

    From the scala source: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala

    • Note that there is a difference between multinomial (softmax) and binary loss. The binary case
    • uses one outcome class as a "pivot" and regresses the other class against the pivot. In the
    • multinomial case, the softmax loss function is used to model each class probability
    • independently. Using softmax loss produces K sets of coefficients, while using a pivot class
    • produces K - 1 sets of coefficients (a single coefficient vector in the binary case). In the
    • binary case, we can say that the coefficients are shared between the positive and negative
    • classes...

    It goes on to talk about how coefficients are not generally identifiable (which is why one would pick a reference label), but that when the regularization is applied the coefficients do become identifiable.