Search code examples
pythonmachine-learningscikit-learnfeature-scaling

Do i need to use RobustScaler() and OneHotEncoder() in new data before model.predict()


Suppose I have this dataframe (in a regression problem) with numerical and categorical data:

                                            df_example

Var1_numerical   Var2_categorical   Var3_numerical   Var4_categorical    Var_to_predict
    20                red            1                    BK                  352352
    10                blue           4                    BL                  345341
     5                orange         6                    BA                  423423
     1                red            3                    BK                  342342
    90                orange         2                    BK                  456456

So, in one part of the process I will use RobustScaler() on the numeric variables and OneHotEncoder() on the categorical variables so that the model can learn from these variables. And now I will have my model trained to predict with a certain error for that prediction.

The interesting thing is to predict on new data using model.predict()

pred_list_example=[15, red, 1, BK]
a = np.array(pred_list)
a = np.expand_dims(a, 0)

model.predict(a)

Question 1: Do I need to use RobustScaler() and OneHotEncoder() on pred_list_example before using model.predict(a)?

Question 2: In case the answer to the previous question is "yes", the Var_to_predict will be scaled due to RobustScaler(). Do I need to use RobustScaler().inverse_transform to get the original numeric value of the prediction?


Solution

  • Question 1: Do I need to use RobustScaler() and OneHotEncoder() on pred_list_example before using model.predict(a)?

    Yes, and more than that: you must use the same RobustScaler() or OneHotEncoder() to do the transformation, or it won't know how much to scale by or what order your one hot categories go in.

    Question 2: In case the answer to the previous question is "yes", the Var_to_predict will be scaled due to RobustScaler(). Do I need to use RobustScaler().inverse_transform to get the original numeric value of the prediction?

    Yes, though note a subtlety: RobustScaler() requires a certain number of columns, and scales each one by a different amount. This means that there's no easy way to give it just your Y variable, and ask it to undo the transform on this one variable.

    For this reason, I suggest having two RobustScaler() instances: one for your X variables and one for your Y variable, so that you can undo scaling on a predicted Y variable without having the X variables to go with it.

    There is also the question of whether it is even needed to scale Y variables. Some people would say that it's not necessary. You can read a pro and con argument here.