Here is my code, and it always returns 100% accuracy, regardless of how big the test size is. I used the train_test_split method, so I do not believe there should be any duplicates of data. Could someone inspect my code?
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
prices.shape
(20640,)
features.shape
(20640, 8)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_train.shape
(16512,)
X_train.shape
(16512, 8)
predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score
EDIT: I have reworked my answer since I found multiple issues. Please copy-paste the below code to ensure no bugs are left.
Issues -
DecisionTreeClassifier
instead of DecisionTreeRegressor
for a regression problem.nans
after doing the test train split which will mess up the count of samples. Do the data.dropna()
before the split.model.score(X_test, y_test)
incorrectly by passing it (X_test, predictions)
. Please use accuracy_score(X_test, predictions)
with those parameters instead, or fix the syntax.from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
data = data.dropna() #<--- SECOND ISSUE
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score