Suppose I have this dataframe (in a regression problem) with numerical
and categorical
Var1_numerical Var2_categorical Var3_numerical Var4_categorical Var_to_predict
20 red 1 BK 352352
10 blue 4 BL 345341
5 orange 6 BA 423423
1 red 3 BK 342342
90 orange 2 BK 456456
So, in one part of the process I will use RobustScaler()
on the numeric variables and OneHotEncoder()
on the categorical variables so that the model can learn from these variables. And now I will have my model trained to predict with a certain error for that prediction.
The interesting thing is to predict on new data using model.predict()
pred_list_example=[15, red, 1, BK]
a = np.array(pred_list)
a = np.expand_dims(a, 0)
Question 1: Do I need to use RobustScaler()
and OneHotEncoder()
on pred_list_example
before using model.predict(a)
Question 2: In case the answer to the previous question is "yes", the Var_to_predict
will be scaled due to RobustScaler()
. Do I need to use RobustScaler().inverse_transform
to get the original numeric value of the prediction?
Question 1: Do I need to use
before usingmodel.predict(a)
Yes, and more than that: you must use the same RobustScaler()
or OneHotEncoder()
to do the transformation, or it won't know how much to scale by or what order your one hot categories go in.
Question 2: In case the answer to the previous question is "yes", the
will be scaled due toRobustScaler()
. Do I need to useRobustScaler().inverse_transform
to get the original numeric value of the prediction?
Yes, though note a subtlety: RobustScaler()
requires a certain number of columns, and scales each one by a different amount. This means that there's no easy way to give it just your Y variable, and ask it to undo the transform on this one variable.
For this reason, I suggest having two RobustScaler()
instances: one for your X variables and one for your Y variable, so that you can undo scaling on a predicted Y variable without having the X variables to go with it.
There is also the question of whether it is even needed to scale Y variables. Some people would say that it's not necessary. You can read a pro and con argument here.