Search code examples
apache-sparkpysparkapache-spark-ml

how do I standardize test dataset using StandardScaler in PySpark?


I have train and test datasets as below:

x_train:

inputs
[2,5,10]
[4,6,12]
...

x_test:

inputs
[7,8,14]
[5,5,7]
...

The inputs column is a vector containing the models features after applying the VectorAssembler class to 3 separate columns.

When I try to transform the test data using the StandardScaler as below, I get an error saying it doesn't have the transform method:

from pyspark.ml.feature import StandardScaler 
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaledTrainDF = scaler.fit(x_train).transform(x_train)
scaledTestDF = scaler.transform(x_test)

I am told that I should fit the standard scaler on the training data only once and use those parameters to transform the test set, so it is not accurate to do:

scaledTestDF = scaler.fit(x_test).transform(x_test)

So how do I deal with the error mentioned above?


Solution

  • Here is the correct syntax to use the scaler. You need to call transform on a fitted model, not on the scaler itself.

    from pyspark.ml.feature import StandardScaler 
    scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
    scaler_model = scaler.fit(x_train)
    
    scaledTrainDF = scaler_model.transform(x_train)
    scaledTestDF = scaler_model.transform(x_test)