apache-spark machine-learning apache-spark-mllib

ML Tuning - Cross Validation in Spark

I am looking the cross validation code example found in https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation

It says:

CrossValidator begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds, CrossValidator will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.

So I don't understand why in the code the data is separated in training and testing:

// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(training)

// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
  (4L, "spark i j k"),
  (5L, "l m n"),
  (6L, "mapreduce spark"),
  (7L, "apache hadoop")
)).toDF("id", "text")

// Make predictions on test documents. cvModel uses the best model found (lrModel).
cvModel.transform(test)
  .select("id", "text", "probability", "prediction")
  .collect()
  .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
    println(s"($id, $text) --> prob=$prob, prediction=$prediction")
  }

Would be possible to apply cross validation and get predictions without separating the data?

Solution

the data is separated into training and test to prevent using the same data, which was used to tune the hyperparameters, again to evaluate the performance of the resulting model. This is to avoid evaluating the model on data that it was trained on because then you would be too optimistic.

Maybe it helps to think of test as the "validation" dataset because training is split into 2/3 training data and 1/3 testing data in each of the k folds.

Here's a good explanation on nested cross-validation

See also this question for a better explanation on why it may make sense to separate the data into 3 sets.