java apache-spark machine-learning apache-spark-mllib

How to get Spark MLlib RandomForestModel.predict response as text value YES/NO?

I am trying to implement RandomForest algorithm using Apache Spark MLLib. I have the dataset in the CSV format with the following features:

DayOfWeek(int),AlertType(String),Application(String),Router(String),Symptom(String),Action(String)
0,Network1,App1,Router1,Not reachable,YES
0,Network1,App2,Router5,Not reachable,NO

I want to use RandomForest MLlib and do prediction on last field Action and I want response as YES/NO.

I am following code from GitHub to create RandomForest model. Since I have all categorical features except one int feature I have used the following code to convert them into JavaRDD<LabeledPoint> - is any of that wrong?

// Load and parse the data file.
        JavaRDD<String> data = jsc.textFile("/tmp/xyz/data/training-dataset.csv");

       // I have 14 features so giving 14 as arg to the following
        final HashingTF tf = new HashingTF(14);

        // Create LabeledPoint datasets for Actionable and nonactionable
        JavaRDD<LabeledPoint> labledData = data.map(new Function<String, LabeledPoint>() {
            @Override public LabeledPoint call(String alert) {
                List<String> featureList = Arrays.asList(alert.trim().split(","));
                String actionType = featureList.get(featureList.size() - 1).toLowerCase();
                return new LabeledPoint(actionType.equals("YES")? 1 : 0, tf.transform(featureList));
            }
        });

Similarly above I create testdata and use in the following code to do prediction

JavaPairRDD<Double, Double> predictionAndLabel =
        testData.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
          @Override
          public Tuple2<Double, Double> call(LabeledPoint p) {
            return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
          }
        });

How do I get prediction based on my last field Action and prediction should come as YES/NO? Current predict method returns double not able to understand how do I implement it? Also am I following the correct approach of categorical feature into LabledPoint? I am new to machine learning and Spark MLlib.

Solution

I am more familiar with the scala version but I'll try to help.

You need to map the target variable (Action) and all categorical features into levels starting in 0 like 0,1,2,3... For example router1, router2, ... router5 into 0,1,2...4. The same with your target variable which I think was the only one you actually mapped, yes/no to 1/0 (I am not sure what your tf.transform(featureList) is actually doing).

Once you have done this you can train your Randomforest classifier specifying the map for categorical features. Basically it needs you to tell which features are categorical and how many levels do they have, this is the scala version but you can easily translate it into java:

val categoricalFeaturesInfo = Map[Int, Int]((2,2),(3,5))

this is basically saying that in your list of features the 3rd one (2) has 2 levels (2,2) and the 4th one (3) has 5 levels (3,5). The rest are considered Doubles.

Now you pass the categoricalFeaturesInfo when training the classifier together with the other parameters as:

val modelRF = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

Now when you need to evaluate it, the predict function will return a double 0,1 and you can use that to compute accuracy, precision or any metric needed.

This is the example (sorry scala again) if you have a testData where you did the same transformations as before:

val predictionAndLabels = testData.map { point =>
  val prediction = modelRF.predict(point.features)
  (point.label, prediction)
}

Here your results are clear, the label as 1/0 and the predicted value is also 1/0, any computation of Accuracy, Precision and Recall is straightforward.

I hope it helps!!