I am looking the cross validation code example found in https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
It says:
CrossValidator
begins by splitting the dataset into a set of folds which are used as separate training and test datasets. E.g., with k=3 folds,CrossValidator
will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.
So I don't understand why in the code the data is separated in training and testing:
// Run cross-validation, and choose the best set of parameters.
val cvModel = cv.fit(training)
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
// Make predictions on test documents. cvModel uses the best model found (lrModel).
cvModel.transform(test)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
Would be possible to apply cross validation and get predictions without separating the data?
the data is separated into training
and test
to prevent using the same data, which was used to tune the hyperparameters, again to evaluate the performance of the resulting model. This is to avoid evaluating the model on data that it was trained on because then you would be too optimistic.
Maybe it helps to think of test
as the "validation" dataset because training
is split into 2/3 training data and 1/3 testing data in each of the k
folds.
Here's a good explanation on nested cross-validation
See also this question for a better explanation on why it may make sense to separate the data into 3 sets.