Search code examples
pythonmachine-learningscikit-learnuser-input

A problem with the user input during the random forest classifier demonstration


I got over 90% accuracy with the Random Forest classifier, but I worry the rest of the algorithms give much lower results: Metrics CV But this is not the main concern. The problem is that when I used user inputs, the prediction was 100 percent wrong. The order of the columns of the user input corresponds to the training data set columns' places.

model = RandomForestClassifier()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
acc = accuracy_score(y_test, prediction)   # output: 0.91

X_test_user = df_user_compounds_1.to_numpy()
user_input_predictions_1 = model.predict(X_test_user) # 
user_input_predictions_1    # output: array([0, 0, 0, 0, 0], dtype=int64), but it should be: array([1, 1, 1, 1, 1],dtype=int64) 

Does anyone have any idea why this is happening?

The dataset is preprocessed - no missing values, no duplicates, it was balanced with RandomOverSampler, scaled with MinMaxScaler, no negative values and contains 11 features/7K rows.

...........

Thank you so much @ElvinJafarov. These are parts from df_user_compounds_1, and X_test after your suggestion.

X_test

Since I had already used MinMaxScaler(), I had to add two more rows to df_user_compounds_1, containing the corresponding min and max values to simulate identical scaling to the original one. I found the max and min values through df.describe(include="all"), concatenated these two rows to the user inputs data frame and scaled User input preprocessing

I am happy with the result: first 5 must be 1, i.e. 4 out of 5 Result


Solution

  • First of all, it is okay that different algorithms give different accuracy rate.

    Secondly, in your case, there might be several reasons.

    1. You have scaled your inputs in data but not in df_user_compounds_1
    2. your model might be overfitted
    3. dataset was preprocessed differently than df_user_compounds_1

    Thirdly, this is not how you approach to choose a model. You have to try K-Fold Cross validationn , hyperparameter tuning