Search code examples
pythonapache-sparkpysparkmulticlass-classification

Spark Multiclass Classification using python


I am trying to implement Multiclass classification using pySpark, I have spent loads of time searching the web, and I have read that it is possible now using Spark 2.1.0.

I have generated my own dataset with all-numerical features and I have created a DataFrame as shown below; Features and the Classes('Service_Level')

I have three classes 'Service_Level' which are either 0, 1 or 2.

Questions:

  1. Do I have to use LabeledPoints if I have features like these?
  2. how do I use a multilayer perceptron instead of logistic regression?

Thanks.


Solution

  • Since there was no answer, I will share what I observed during research. using Labeled Points is ok when using the Spark MLlib which is now in maintenance mode in Spark 2.1.0. However, my features were categorical hence using the DataFrame API with Spark ML, I had to convert them to vectors using StringIndexer, OneHotEncoder and Pipelines to select my features and labels.

    Answering the question
    Yes, Labeled Points can be used with those features but when using Spark MLlib. I was not able to implement the Multilayer Perceptron because somehow it required libsvm formatted data which I did not have and could not convert my CSV into such.

    In the final implementation, I had to use the Dataframe based API Spark ml