I build an application that trains a Keras binary classifier model (0 or 1) every x time (hourly,daily) given the new data. The data preparation, training and testing works well, or at least as expected. It tests different features and scales it with MinMaxScaler (some values are negative).
On live data predictions with one single data point, the values are unrealistic (around 0.9987 to 1 most of the time, which is inaccurate). Since the result should be how close to "1" the prediction is, getting such high numbers constantly raises alerts.
Code for live prediction is as follows
current_df is a pandas dataframe that contains the 1 row with the data pulled live and the column headers, we select the "features" (since why load the features from the db and we implement dynamic feature selection when training the model, which could mean on every model there are different features)
Get the features as a list:
# Convert literal str to list
features = ast.literal_eval(features)
Then select only the features that I need in the dataframe:
# Select the features
selected_df = current_df[features]
Get the values as a list:
# Get the values of the df
current_list = selected_df.values.tolist()[0]
Then I reshape it:
# Reshape to allow scaling and predicting
current_list = np.reshape(current_list, (-1, 1))
If I call "transform" instead of "fit_transform" in the line above, I get the following error: This MinMaxScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Reshape again:
# Reshape to be able to scale
current_list = np.reshape(current_list, (1, -1))
Loads the model using Keras (model_location is a Path) and predict:
# Loads the model from the local folder
reconstructed_model = keras.models.load_model(model_location)
prediction = reconstructed_model.predict(current_list)
prediction = prediction.flat[0]
Updated
The data gets scaled using fit_transform and transform (MinMaxScaler although it can be Standard Scaler):
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)
And this is run when training the model (the "model" config is not shown):
# Compile the model
model.compile(optimizer=optimizer,
loss=loss,
metrics=['binary_accuracy'])
# build the model
model.fit(X_train, y_train, epochs=epochs, verbose=0)
# Evaluate using Keras built-in function
scores = model.evaluate(X_test, y_test, verbose=0)
testing_accuracy = scores[1]
# create model with sklearn KerasClassifier for evaluation
eval_model = KerasClassifier(model=model, epochs=epochs, batch_size=10, verbose=0)
# Evaluate model using RepeatedStratifiedKFold
accuracy = ML.evaluate_model_KFold(eval_model, X_test, y_test)
# Predict testing data
pred_test= model.predict(X_test, verbose=0)
pred_test = pred_test.flatten()
# extract the predicted class labels
y_predicted_test = np.where(pred_test > 0.5, 1, 0)
Regarding feature selection, the features are not always the same --I use both SelectKBest (10 or 15 features) or RFECV. And select the trained model with highest accuracy, meaning the features can be different.
Is there anything I'm doing wrong here? I'm thinking maybe the scaling should be done before the feature selection or there's some issue with the scaling (since maybe some values might be 0 when training and 100 when using it and the features are not necessarily the same when scaling).
The issues seems to stem from a StandardScaler
/ MinMaxScaler
. The following example shows how to apply the former. However, if there are separate scripts handling learning/prediction, then the scaler
will also need to be serialized and loaded at prediction time.
Set up a classification problem:
X, y = make_classification(n_samples=10_000)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
Fit a StandardScaler
instance on the training set and use the same parameters to .transform
the test set:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
# Train time: Serialize the scaler to a pickle file.
with open("scaler.pkl", "wb") as fh:
pickle.dump(scaler, fh)
# Test time: Load the scaler and apply to the test set.
with open("scaler.pkl", "rb") as fh:
new_scaler = pickle.load(fh)
X_test = new_scaler.transform(X_test)
Which means that the model should be fit on features with similar distributions:
model = keras.Sequential([
keras.Input(shape=X_train.shape[1]),
layers.Dense(100),
layers.Dropout(0.1),
layers.Dense(1, activation="relu")])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["binary_accuracy"])
model.fit(X_train, y_train, epochs=25)
y_pred = np.where(model.predict(X_test)[:, 0] > 0.5, 1, 0)
print(accuracy_score(y_test, y_pred))
# 0.8708