Search code examples
pythonpandasmachine-learningscikit-learndecision-tree

Getting 100% Accuracy on my DecisionTree Model


Here is my code, and it always returns 100% accuracy, regardless of how big the test size is. I used the train_test_split method, so I do not believe there should be any duplicates of data. Could someone inspect my code?

from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


data = pd.read_csv('housing.csv')

prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)

prices.shape
(20640,)

features.shape
(20640, 8)


X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_train.shape
(16512,)

X_train.shape
(16512, 8)


predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score 

Solution

  • EDIT: I have reworked my answer since I found multiple issues. Please copy-paste the below code to ensure no bugs are left.

    Issues -

    1. You are using DecisionTreeClassifier instead of DecisionTreeRegressor for a regression problem.
    2. You are removing nans after doing the test train split which will mess up the count of samples. Do the data.dropna() before the split.
    3. You are using the model.score(X_test, y_test) incorrectly by passing it (X_test, predictions). Please use accuracy_score(X_test, predictions) with those parameters instead, or fix the syntax.
    from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    
    
    data = pd.read_csv('housing.csv')
    
    data = data.dropna() #<--- SECOND ISSUE
    
    prices = data['median_house_value']
    features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
    
    X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
    
    model = DecisionTreeRegressor()
    model.fit(X_train, y_train)
    
    predictions = model.predict(X_test)
    score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
    score