I'm learning XGBoost, and the mae and rmse numbes are so large, how is that possible?
this is the code I'm using in python
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)
# Print cv_results
print(cv_results)
# Extract and print final boosting round metric
print((cv_results["test-rmse-mean"]).tail(1))
train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std
0 141767.535156 429.452682 142980.429688 1193.794436
1 102832.542969 322.473304 104891.392578 1223.157623
2 75872.617187 266.469946 79478.935547 1601.344218
3 57245.651367 273.625016 62411.921875 2220.149857
4 44401.297851 316.422372 51348.281250 2963.378741
51348.28125
I think your problem is to interpret the metrics. First I'll explain what it stnads for:
This means that both metrics depend on the size of the predicted value. If you predict number of seats in a car that varies between 2 and 7, your RMSE is really large. On the other hand if you predict something that varries between 1 and 100 million, the RMSE is really low. That is the main reason why you should use some other metric such as MAPE (Mean Absolute Percentage Error) that will give you value between 0 and 1.
Check out this link for more information about MAPE and how to use it using scikit-learn.