Search code examples
scalaapache-sparkapache-spark-mllib

Use a saved model to transform another data without fitting again Spark


I am using Spark (core and Mllib) version 2.2.0 with Scala.

I successfully saved a CrossValidator model with Logistic Regression. Below is the code that I used

  val cv = new CrossValidator()
    .setEstimator(lr)
    .setEvaluator(new BinaryClassificationEvaluator)
    .setEstimatorParamMaps(paramGrid)
    .setNumFolds(5)

  val model = cv.fit(trainingData)

  model.write.overwrite().save("./cvmodel")

After that, I'm trying to use it for another dataset with the code below

  val model = CrossValidatorModel.read.load("./cvmodel")

  val cleanData = DataApi.cleanData(dataset, spark) // custom method

  val preparedData = DataApi.oneHotEncodingData(cleanData).select("label","features") // custom method

  val predict_dataset = model.transform(preparedData)

  printResult(predict_dataset) // A custom method that uses metrics to print the statistics
                               // of the result

However, when using datasets of different sizes compared to the test data (whether more or less), I get this error thrown

java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes: x.size = 1178, y.size = 9921

The code is actually working with a dataset of the same size. Therefore, I would like to know if it is possible to use the saved model with another dataset of different size without the need to fit it again. If so, I would like to know how.

Thank you for your help.


Solution

  • I actually found the cause of this error. During my one hot enconding process, I was actually using some pipelines that I didn't save like my CrossValidatorModel. What I had to do was :

    1. Fit my one hot encoding pipelines with my training dataset and save them as models
    2. Fit my CrossValidator with my training dataset and save it as a model
    3. Load my one hot encoding pipeline models and transform my test dataset
    4. Load my CrossValidatorModel and transform my test dataset

    When doing this, I no longer had issues.