Search code examples
apache-spark-mllibapache-spark-ml

How to specify "positive class" in sparkml classification?


How to specify the "positive class" in sparkml (binary) classification? (Or perhaps: How does a MulticlassClassificationEvaluator determine which class is the "positive" one?)

Suppose we were training a model to target Precision in a binary classification problem like...

label_idxer = StringIndexer(inputCol="response",
                            outputCol="label").fit(df_spark)
# we fit so we can get the "labels" attribute to inform reconversion stage

feature_idxer = StringIndexer(inputCols=cat_features,
                              outputCols=[f"{f}_IDX" for f in cat_features],
                              handleInvalid="keep")

onehotencoder = OneHotEncoder(inputCols=feature_idxer.getOutputCols(),
                              outputCols=[f"{f}_OHE" for f in feature_idxer.getOutputCols()])

assembler = VectorAssembler(inputCols=(num_features + onehotencoder.getOutputCols()),
                            outputCol="features")

rf = RandomForestClassifier(labelCol=label_idxer.getOutputCol(),
                            featuresCol=assembler.getOutputCol(),
                            seed=123456789)

label_converter = IndexToString(inputCol=rf.getPredictionCol(),
                                outputCol="prediction_label",
                                labels=label_idxer.labels)

pipeline = Pipeline(stages=[label_idxer, feature_idxer, onehotencoder,
                            assembler,
                            rf,
                            label_converter])  # type: pyspark.ml.pipeline.PipelineModel

crossval = CrossValidator(estimator=pipeline,
                          evaluator=MulticlassClassificationEvaluator(
                              labelCol=rf.getLabelCol(),
                              predictionCol=rf.getPredictionCol(),
                              metricName="weightedPrecision"),  
                          numFolds=3)

(train_u, test_u) = dff.randomSplit([0.8, 0.2])
model = crossval.fit(train_u)

I know that...

Precision = TP / (TP + FP) 

...but how do you specify a particular class label as the "positive class" to target for Precision? (As it stands, IDK which response value is actually being used as such in training nor how to tell).


Solution

  • From a discussion on the spark mailing list...

    The positive class is "1" and negative is "0" by convention; I don't think you can change that (though you can translate your data if needed). F1 is defined only in a one-vs-rest sense in multi-class evaluation. You can set 'metricLabel' to define which class is 'positive' in multiclass - everything else is 'negative'.

    Note that this implies that (sans setting the metricLabel in a MulticlassEvaluator) StringIndexer (specifically the stringOrderType param https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.StringIndexer.html?highlight=stringindexer#pyspark.ml.feature.StringIndexer.stringOrderType ) would be the place for a user to understand what they are saying is their positive/negative class. (Note that, per the docs, the default is frequencyDesc. In case of equal frequency when under frequencyDesc/Asc, the strings are further sorted alphabetically (ie. in the case of minorty positive class you'd be fine else need the naming to follow the 0=neg 1=pos convention)).

    In multi-class, there is no 'positive' class, they're all just classes. It defaults to 0 there but 0 doesn't have any particular meaning. You could apply this to a binary class setup. In that case, you could simply ask for F1 for label 0, and that would compute F1 for '0-vs-rest', and that would be like treating 0 as the 'positive' class for purposes of F1.

    One thing about this interpretation that is concerning is that it seems that the BinaryClassificationEvaluator does not have the ability to evaluate Fbeta, Recall, Precision, etc. (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.BinaryClassificationEvaluator.html?highlight=binaryclassificationevaluator#pyspark.ml.evaluation.BinaryClassificationEvaluator.metricName) whereas the MulticlassClassificationEvaluator does (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.MulticlassClassificationEvaluator.html?highlight=classificationevaluator#pyspark.ml.evaluation.MulticlassClassificationEvaluator.metricName), meaning that users would need to switch between the two if they wanted to try training a model to target AreaUnderROC or, say, F1 which in the case of binary classification means they would need to switch the indexed value of the positive class from 1 (in binary classification, since you say 1 is the conventional positive class) to 0 (for the multiclass Evaluator, since the docs say the default metricLabel is 0).