Search code examples
pythonmachine-learningscikit-learnregressionsklearn-pandas

Random Forest Regression Accuracy different for Training set and Test set


I am new to Machine Learning and to Python. I am trying to build a Random Forest Regression model on one of the datasets from the UCI repository. This is my first ML model. I may be entirely wrong in my approach.

The dataset is available here - https://archive.ics.uci.edu/ml/datasets/abalone

Below is the entire working code that I have written. I am using Python 3.6.4 with Windows 7 x64 OS (forgive me for the lengthy code).

import tkinter as tk # Required for enabling GUI options
from tkinter import messagebox # Required for pop-up window
from tkinter import filedialog # Required for getting full path of file
import pandas as pd # Required for data handling
from sklearn.model_selection import train_test_split # Required for splitting data into training and test set
from sklearn.ensemble import RandomForestRegressor # Required to build random forest

#------------------------------------------------------------------------------------------------------------------------#
# Create an instance of tkinter and hide the window

root = tk.Tk() # Create an instance of tkinter
root.withdraw() # Hides root window
#root.lift() # Required for pop-up window management
root.attributes("-topmost", True) # To make pop-up window stay on top of all other windows

#------------------------------------------------------------------------------------------------------------------------#
# This block of code reads input file using tkinter GUI options

print("Reading input file...")

# Pop up window to ask user the input file
File_Checker = messagebox.askokcancel("Random Forest Regression Prompt",
                                      "At The Prompt, Enter 'Abalone_Data.csv' File.")

# Kill the execution if user selects "Cancel" in the above pop-up window
if (File_Checker == False):
    quit()
else:
    del(File_Checker)

file_loop = 0

while (file_loop == 0):
    # Get path of base file
    file_path =  filedialog.askopenfilename(initialdir = "/",
                                            title = "File Selection Prompt",
                                            filetypes = (("XLSX Files","*.*"), ))

    # Condition to check if user selected a file or not
    if (len(file_path) < 1):
        # Pop-up window to warn uer that no file was selected
        result = messagebox.askretrycancel("File Selection Prompt Error",
                                           "No file has been selected. \nWhat do you want to do?")

        # Condition to repeat the loop or quit program execution
        if (result == True):
            continue
        else:
            quit()

    # Get file name
    file_name = file_path.split("/") # Splits the file with "/" as the delimiter and returns a list
    file_name = file_name[-1] # extracts the last element of the list

    # Condition to check if correct file was selected or not
    if (file_name != "Abalone_Data.csv"):
        result = messagebox.askretrycancel("File Selection Prompt Error",
                                           "Incorrect file selected. \nWhat do you want to do?")

        # Condition to repeat the loop or quit program execution
        if (result == True):
            continue
        else:
            quit()

    # Read the base file
    input_file = pd.read_csv(file_path,
                             sep = ',',
                             encoding = 'utf-8',
                             low_memory = True)

    break

# Delete unwanted files
del(file_loop, file_name)

#------------------------------------------------------------------------------------------------------------------------#
print("Preparing dependent and independent variables...")

# Create Separate dataframe consisting of only dependent variable
y = pd.DataFrame(input_file['Rings'])

# Create Separate dataframe consisting of only independent variable
X = input_file.drop(columns = ['Rings'], inplace = False, axis = 1)

#------------------------------------------------------------------------------------------------------------------------#
print("Handling Dummy Variable Trap...")

# Create a new dataframe to handle categorical data
# This method splits the dategorical data column into separate columns
# This is to ensure we get rid of the dummy variable trap
dummy_Sex = pd.get_dummies(X['Sex'], prefix = 'Sex', prefix_sep = '_', drop_first = True)

# Remove the speciic columns from the dataframe
# These are the categorical data columns which split into separae columns in the previous step
X.drop(columns = ['Sex'], inplace = True, axis = 1)

# Merge the new columns to the original dataframe
X = pd.concat([X, dummy_sex], axis = 1)

#------------------------------------------------------------------------------------------------------------------------#
y = y.values 
X = X.values

#------------------------------------------------------------------------------------------------------------------------#
print("Splitting datasets to training and test sets...")

# Splitting the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#------------------------------------------------------------------------------------------------------------------------#
print("Fitting Random Forest regression on training set")

# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message

#------------------------------------------------------------------------------------------------------------------------#
print("Predicting Values")

# Predicting a new result with regression
y_pred = regressor.predict(X_test)

# Enter values for new prediction as a Dictionary
test_values = {'Sex_I' : 0,
               'Sex_M' : 0,
               'Length' : 0.5,
               'Diameter' : 0.35,
               'Height' : 0.8,
               'Whole_Weight' : 0.223,
               'Shucked_Weight' : 0.09,
               'Viscera_Weight' : 0.05,
               'Shell_Weight' : 0.07}

# Convert dictionary into dataframe
test_values = pd.DataFrame(test_values, index = [0])

# Rearranging columns as required
test_values = test_values[['Length','Diameter','Height','Whole_Weight','Shucked_Weight','Viscera_Weight',
                           'Viscera_Weight', 'Sex_I', 'Sex_M']]

# Applying feature scaling
#test_values = sc_X.transform(test_values)

# Predicting values of new data
new_pred = regressor.predict(test_values)

#------------------------------------------------------------------------------------------------------------------------#
"""
print("Building Confusion Matrix...")

# Making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
"""
#------------------------------------------------------------------------------------------------------------------------#
print("\n")
print("Getting Model Accuracy...")

# Get regression details
#print("Estimated Coefficient = ", regressor.coef_)
#print("Estimated Intercept = ", regressor.intercept_)
print("Training Accuracy = ", regressor.score(X_train, y_train))
print("Test Accuracy = ", regressor.score(X_test, y_test))

print("\n")
print("Printing predicted result...")
print("Result_of_Treatment = ", new_pred)

When I look at the model accuracy, below is what I get.

Getting Model Accuracy...
Training Accuracy =  0.9359702279804791
Test Accuracy =  0.5695080680053354

Below are my questions. 1) Why are the Training Accuracy and Test Accuracy so far away?

2) How do I know if this model is being over/under fitted?

3) Is Random forest Regression the right model to use? If no, how do i determine the right model for this use-case?

3) How can I build a confusion matrix using the variables I have created?

4) How do I validate the performance of the model?

I am looking for your guidance so that I too can learn from my mistakes and improve on my modelling skills.


Solution

  • Before trying to answer to your points, a comment: I see you are using a Regressor with accuracy as metric. But accuracy is a metric used in classification problems; in regressions models you usually use other metrics, as Mean Squared Error (MSE). See here.

    If you just switch to a more adapt metric, maybe you will find that your model is not so bad.

    I’m going anyway to reply to your questions.

    Why are the Training Accuracy and Test Accuracy so far away? This means that you overfitted your training samples: your model is very strong in predicting the data of the training dataset, but unable to generalise. Is like having a model trained on a set of cat pictures which believe only those pictures are cats, and all the other pictures of all the other cats are not. In fact, you have an accuracy on the test set of ~0.5, which is basically a random guess.

    How do I know if this model is being over/under fitted? Exactly form the difference in accuracy between the two sets. The more they are near each other, the more the model is able to generalise. You already know how on overfit looks like. An underfit is generally recognisable because of a low accuracy in both sets.

    Is Random forest Regression the right model to use? If no, how do i determine the right model for this use-case? There is not a right model to use. Random Forest, and in general all the tree-based model (LightGBM, XGBoost) are the Swiss army knife of machine learning when you are dealing with structured data, because of their simplicity and reliability. Model based on Deep Learning perform better in theory, but much more complex to set up.

    How can I build a confusion matrix using the variables I have created? Confusion matrices can be created when you build a classification model, and are built on the output of your model. You are using a regressor, it do not have lot of sense.

    How do I validate the performance of the model? In general, for a reliable validation of performances you split the data I three: you train on one (a.k.a. training set), tune the model on a second (a.k.a. validation set, this is what you call test set), and finally, when you are happy with the model and its hyper-parameters, you test it on the third (a.k.a. test set, not to be confused with the one you call test set). This last one tells you if your model generalize well or not. This because when you choose and tune the model you can also overfit the validation set (the one you call test set), maybe selecting a set of hyper-parameters which performs well only on that set. Also, you have to choose a reliable metric, and this depends both on the data and on the model. With regressions, the MSE is pretty good.