I run Spark 2.1.1 on my Mac, OS Sierra (should this be helpful). I tried to fit a multinomial logistic regression on a test dataset I found online, I report here the first few lines (I don't know how to attach the file here):
1,0,24
1,0,26
1,0,26
1,1,27
1,1,27
3,1,27
The first column is the label ('brand', values: 1, 2, 3), the second and third columns are the features ('sex' and 'age').
Since the label have 3 classes, the multinomial logistic regression should perform 3 binomial models and then choose the predictions from the one which maximizes the probability of being in that class. So I expect the model to return a 3x2 coefficientMatrix: 3 because the classes are 3, and 2 because the features are 2. This documentation seems to be coherent with this point of view.
But, surprise surprise...
>>> logit_model.coefficientMatrix
DenseMatrix(4, 2, [-1.2781, -2.8523, 0.0961, 0.5994, 0.6199, 0.9676, 0.5621, 1.2853], 1)
>>> logit_model.interceptVector
DenseVector([-4.5912, 13.0291, 1.2544, -9.6923])
The coefficientMatrix is 4x2, and I have 4 intercepts rather than 3. Even stranger is this:
>>> logit_model.numClasses
4
For some strange reason, the model "feel" 4 classes, even if I have just 3 (see code below for a check on this).
Any suggestion? Thank you very much.
Here is the full code:
from pyspark.sql import functions as f
from pyspark.sql import types as t
from pyspark.ml import classification as cl
from pyspark.ml import feature as feat
customSchema = t.StructType(
[t.StructField('brand', t.IntegerType(), True),
t.StructField('sex', t.IntegerType(), True),
t.StructField('age', t.IntegerType(), True)]
)
test_df01 = (
spark
.read
.format('csv')
.options(delimiter=',', header=False)
.load('/Users/vanni/Downloads/mlogit_test.csv', schema=customSchema)
)
va = (
feat.VectorAssembler()
.setInputCols(['sex', 'age'])
.setOutputCol('features')
)
test_df03 = (
va
.transform(test_df01)
.drop('sex')
.drop('age')
.withColumnRenamed('brand', 'label')
)
logit_abst = (
cl.LogisticRegression()
.setFamily('multinomial')
.setStandardization(False)
.setThresholds([.5, .5, .5]) # to be adjusted after I know the actual values
.setThreshold(None)
.setMaxIter(100) # default
.setRegParam(0.0) # default
.setElasticNetParam(0.0) # default
.setTol(1e-6) # default
)
logit_model = logit_abst.fit(test_df03)
Here is the check that the classes are just 3:
>>> test_df03.select('label').distinct().orderBy('label').show()
+-----+
|label|
+-----+
| 1|
| 2|
| 3|
+-----+
There is nothing strange going on here. Spark assumes that labels are consecutive integer values, represented as DoubleType
, and starting with 0.
Since the largest label you get is 3, Spark assumes that labels are actually 0, 1, 2, 3 - even if 0 never occurs in the dataset.
If this behavior is undesired you should re-encode labels to be zero-based, or apply StringIndexer
on the raw labels.