Search code examples
pythonmachine-learningscikit-learnrandom-forestimbalanced-data

Random Forest gets 98% accuracy in training and testing but always predicts the same class otherwise


I have spent 30 hours on this single problem de-bugging and it makes absolutely no sense, hopefully one of you guys can show me a different perspective.

The problem is that I use my training dataframe in a random forest and get very good accuracy 98%-99% but when I try and load in a new sample to predict on. The model ALWAYS guesses the same class.

#  Shuffle the data-frames records. The labels are still attached
df = df.sample(frac=1).reset_index(drop=True)

#  Extract the labels and then remove them from the data
y = list(df['label'])
X = df.drop(['label'], axis='columns')

#  Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE)

#  Construct the model
model = RandomForestClassifier(n_estimators=N_ESTIMATORS, max_depth=MAX_DEPTH, random_state=RANDOM_STATE,oob_score=True)

#  Calculate the training accuracy
in_sample_accuracy = model.fit(X_train, y_train).score(X_train, y_train)
#  Calculate the testing accuracy
test_accuracy = model.score(X_test, y_test)

print()
print('In Sample Accuracy: {:.2f}%'.format(model.oob_score_ * 100))
print('Test Accuracy: {:.2f}%'.format(test_accuracy * 100))

The way I am processing the data is the same, but when I predict on the X_test or X_train I get my normal 98% and when I predict on my new data it always guesses the same class.

    #  The json file is not in the correct format, this function normalizes it
    normalized_json = json_normalizer(json_file, "", training=False)
    #  Turn the json into a list of dictionaries which contain the features
    features_dict = create_dict(normalized_json, label=None)

    #  Convert the dictionaries into pandas dataframes
    df = pd.DataFrame.from_records(features_dict)
    print('Total amount of email samples: ', len(df))
    print()

    df = df.fillna(-1)
    #  One hot encodes string values
    df = one_hot_encode(df, noOverride=True)
    if 'label' in df.columns:
        df = df.drop(['label'], axis='columns')
    print(list(model.predict(df))[:100])
    print(list(model.predict(X_train))[:100])

Above is my testing scenario, you can see in the last two lines I am predicting on X_train the data used to train the model and df the out of sample data that it always guesses class 0.

Some useful information:

  • The datasets are imbalanced; class 0 has about 150,000 samples while class 1 has about 600,000 samples
  • There are 141 features
  • changing the n_estimators and max_depth doesn't fix it

Any ideas would be helpful, also if you need more information let me know my brain is fried right now and that's all I could think of.


Solution

  • Fixed, The issue was the imbalance of datasets also I realized that changing the depth gave me different results.

    For example, 10 trees with 3 depth -> seemed to work fine 10 trees with 6 depth -> back to guessing only the same class