python pandas machine-learning scikit-learn decision-tree

How can I improve the accuracy of my prediction from a decision tree model using sklearn?

I have created a decision tree model in Python using sklearn, and it takes data from a large public data set that relates human factors (age, bmi, sex, smoking, etc) to cost of medical care that insurance companies pay each year. I split the data set with a test size of .2, but mean absolute error and mean squared error are incredibly high. I tried doing different splits (.5, .8) but I have not gotten any different results. The prediction model appears to be quite off in some areas but I am not sure what part is lacking and what I need to improve. I have attached photos of my output (through an IMGUR link as I cannot add photos) as well as my code, and I appreciate any guidance in the right direction!

https://i.sstatic.net/KrOMv.jpg

dataset = pd.read_csv('insurance.csv')

LE = LabelEncoder()
LE.fit(dataset.sex.drop_duplicates())
dataset.sex = LE.transform(dataset.sex)
LE.fit(dataset.smoker.drop_duplicates())
dataset.smoker = LE.transform(dataset.smoker)
LE.fit(dataset.region.drop_duplicates())
dataset.region = LE.transform(dataset.region)

print("Data Head")
print(dataset.head())
print()
print("Data Info")
print(dataset.info())
print()



for i in dataset.columns:
    print('Null Values in {i} :'.format(i = i) , dataset[i].isnull().sum())


X = dataset.drop('charges', axis = 1) 
y = dataset['charges'] 


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=0)  

regressor = DecisionTreeRegressor()  
regressor.fit(X_train, y_train)  

y_pred = regressor.predict(X_test) 

df = pd.DataFrame({'Actual Value': y_test, 'Predicted Values': y_pred})  
print(df)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Solution

Certain things you can do if you are not doing already:

Use StandardScaler() from scikit-learn on non-categorical columns/features.
Use GridSearchCV from scikit-learn to search for appropriate hyper-parameters, instead of doing it manually. Although, choosing to do so manually may give you some sense of which parameter values might work.
Check the documentation of DecisionTreeRegressor carefully to make sure that your implementation is in agreement with the documentation.

See if this helps.