apache-spark machine-learning pyspark logistic-regression multinomial

Unexpected coefficients from Spark multinomial Logistic Regression

I run Spark 2.1.1 on my Mac, OS Sierra (should this be helpful). I tried to fit a multinomial logistic regression on a test dataset I found online, I report here the first few lines (I don't know how to attach the file here):

1,0,24
1,0,26
1,0,26
1,1,27
1,1,27
3,1,27

The first column is the label ('brand', values: 1, 2, 3), the second and third columns are the features ('sex' and 'age').

Since the label have 3 classes, the multinomial logistic regression should perform 3 binomial models and then choose the predictions from the one which maximizes the probability of being in that class. So I expect the model to return a 3x2 coefficientMatrix: 3 because the classes are 3, and 2 because the features are 2. This documentation seems to be coherent with this point of view.

But, surprise surprise...

>>> logit_model.coefficientMatrix
DenseMatrix(4, 2, [-1.2781, -2.8523, 0.0961, 0.5994, 0.6199, 0.9676, 0.5621, 1.2853], 1)
>>> logit_model.interceptVector
DenseVector([-4.5912, 13.0291, 1.2544, -9.6923])

The coefficientMatrix is 4x2, and I have 4 intercepts rather than 3. Even stranger is this:

>>> logit_model.numClasses
4

For some strange reason, the model "feel" 4 classes, even if I have just 3 (see code below for a check on this).

Any suggestion? Thank you very much.

Here is the full code:

from pyspark.sql import functions as f
from pyspark.sql import types as t
from pyspark.ml import classification as cl
from pyspark.ml import feature as feat

customSchema = t.StructType(
    [t.StructField('brand', t.IntegerType(), True),
    t.StructField('sex', t.IntegerType(), True),
    t.StructField('age', t.IntegerType(), True)]
)

test_df01 = (
    spark
    .read
    .format('csv')
    .options(delimiter=',', header=False)
    .load('/Users/vanni/Downloads/mlogit_test.csv', schema=customSchema)
)

va = (
    feat.VectorAssembler()
    .setInputCols(['sex', 'age'])
    .setOutputCol('features')
)
test_df03 = (
    va
    .transform(test_df01)
    .drop('sex')
    .drop('age')
    .withColumnRenamed('brand', 'label')
)

logit_abst = (
    cl.LogisticRegression()
    .setFamily('multinomial')
    .setStandardization(False)
    .setThresholds([.5, .5, .5]) # to be adjusted after I know the actual values
    .setThreshold(None)
    .setMaxIter(100) # default
    .setRegParam(0.0) # default
    .setElasticNetParam(0.0) # default
    .setTol(1e-6) # default
)

logit_model = logit_abst.fit(test_df03)

Here is the check that the classes are just 3:

>>> test_df03.select('label').distinct().orderBy('label').show()
+-----+
|label|
+-----+
|    1|
|    2|
|    3|
+-----+

Solution

There is nothing strange going on here. Spark assumes that labels are consecutive integer values, represented as DoubleType, and starting with 0.

Since the largest label you get is 3, Spark assumes that labels are actually 0, 1, 2, 3 - even if 0 never occurs in the dataset.

If this behavior is undesired you should re-encode labels to be zero-based, or apply StringIndexer on the raw labels.