I have created a decision tree model in Python using sklearn, and it takes data from a large public data set that relates human factors (age, bmi, sex, smoking, etc) to cost of medical care that insurance companies pay each year. I split the data set with a test size of .2, but mean absolute error and mean squared error are incredibly high. I tried doing different splits (.5, .8) but I have not gotten any different results. The prediction model appears to be quite off in some areas but I am not sure what part is lacking and what I need to improve. I have attached photos of my output (through an IMGUR link as I cannot add photos) as well as my code, and I appreciate any guidance in the right direction!
https://i.sstatic.net/KrOMv.jpg
dataset = pd.read_csv('insurance.csv')
LE = LabelEncoder()
LE.fit(dataset.sex.drop_duplicates())
dataset.sex = LE.transform(dataset.sex)
LE.fit(dataset.smoker.drop_duplicates())
dataset.smoker = LE.transform(dataset.smoker)
LE.fit(dataset.region.drop_duplicates())
dataset.region = LE.transform(dataset.region)
print("Data Head")
print(dataset.head())
print()
print("Data Info")
print(dataset.info())
print()
for i in dataset.columns:
print('Null Values in {i} :'.format(i = i) , dataset[i].isnull().sum())
X = dataset.drop('charges', axis = 1)
y = dataset['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual Value': y_test, 'Predicted Values': y_pred})
print(df)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Certain things you can do if you are not doing already:
StandardScaler()
from scikit-learn on non-categorical columns/features. GridSearchCV
from scikit-learn to search for appropriate hyper-parameters, instead of doing it manually. Although, choosing to do so manually may give you some sense of which parameter values might work. DecisionTreeRegressor
carefully to make sure that your implementation is in agreement with the documentation.See if this helps.