machine-learning scikit-learn prediction scaling

How to scale prediction data when we load a pre-trained model and we don't have train dataset?

Assume that I have a train dataset. I split it into train / test. For training, I use standard scaler to fit.transform on train data and transform on test data. Then, I train a model and save it.

train.py:

data = pd.read_csv("train.csv")
X = data["X"]
y = data["y"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

scale = StandardScaler()
X_train_s = scale.fit_transform(X_train)
X_test_s = scale.transform(X_test)

model.fit(X_train_s, y_train)
y_pred = model.predcit(X_test_s)

# save model
joblib.dump(model, filename)

Now, I load the model in another script, and I have another dataset only for prediction. Question is how to scale prediction dataset when I don't have train dataset. Is it correct to fit.transform on prediction dataset as below?

prediction.py

data = pd.read_csv("prediction.csv")
X = data["X"]
y = data["y"]

scale = StandardScaler()
X_predict_s = scale.fit_transform(X)

loaded_model = joblib.load(filename)
y_pred = loaded_model(X_predict_s)

Or I have to load train data into prediction.py and use it to fit.transform scaler.

Solution

I like using pickle, but the same logic applies to joblib.

In essence, you have to dump your scaler and load it in the new script, just like you did with model and loaded_model.

In the script where you trained the model:

from pickle import dump

# save model
dump(model, open('model.pkl', 'wb'))
# save scaler
dump(scale, open('scale.pkl', 'wb'))

In the script where you load the model:

from pickle import load

# load model
loaded_model = load(model, open('model.pkl', 'rb'))
# load scaler
loaded_scale = load(scale, open('scale.pkl', 'rb'))

Now you have to transform your data using loaded_scale and predict on the scaled data using loaded_model.