Search code examples
pythonmachine-learningdataset

finding the accuracy of the train and test model


here is the dataset i'm working. This dataset is the survey of VR application areas from 2019 to 2021, where N represents the "number of applications in each area" and % represents the "percentage of the total sample". I'm having problems while finding the accuracy of the train and test models.

At first libraries imported and read the csv file:

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

df = pd.read_csv('VR_application.csv')

now split the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(df[['year']], df[['N', '%']], test_size=0.2, random_state=42)

Creating and training the linear regression model:

model = LinearRegression()

model.fit(X_train, y_train)

Predicting the values for training and testing sets

y_train_pred = model.predict(X_train)

y_test_pred = model.predict(X_test)

Calculating the accuracy (R-squared score) for train and test sets

train_accuracy = r2_score(y_train, y_train_pred)

test_accuracy = r2_score(y_test, y_test_pred)

checking the accuracy on train and test sets:

print(f"Train Accuracy: {train_accuracy}")

print(f"Test Accuracy: {test_accuracy}")

After doing this I get the accuracy:

Train Accuracy: 0.004421041085529986

Test Accuracy: -0.09666987166765573

Can you check my code and identify the problem I'm missing?


Solution

    • Firstly, R2_score is the metric used to assess how well the regression model fits the observed data. In other words, The R2 score ranges from 0 to 1, where a higher value indicates a better fit of the model to the data. A score of 1 means the model perfectly explains the variability in the dependent variable, while a score of 0 means the model does not explain any of the variability. So accuracy and r2_score is completely different.
    • Secondly, I can see that, you have used "train_test_split" in a wrong way. Use this: X_train, X_test, y_train, y_test = train_test_split(df['N', '%'], df['year'], test_size=0.2, random_state=42)