Search code examples
pythonmachine-learningstatisticsregressionxgboost

why the rmse and mse is so large using XGBoost?


I'm learning XGBoost, and the mae and rmse numbes are so large, how is that possible?

this is the code I'm using in python

# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=4, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

# Extract and print final boosting round metric
print((cv_results["test-rmse-mean"]).tail(1))


    train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0    141767.535156      429.452682   142980.429688    1193.794436
1    102832.542969      322.473304   104891.392578    1223.157623
2     75872.617187      266.469946    79478.935547    1601.344218
3     57245.651367      273.625016    62411.921875    2220.149857
4     44401.297851      316.422372    51348.281250    2963.378741
    51348.28125

Solution

  • I think your problem is to interpret the metrics. First I'll explain what it stnads for:

    • MSE stands for Mean squared error and
    • RMSE stands for Root mean squared error

    This means that both metrics depend on the size of the predicted value. If you predict number of seats in a car that varies between 2 and 7, your RMSE is really large. On the other hand if you predict something that varries between 1 and 100 million, the RMSE is really low. That is the main reason why you should use some other metric such as MAPE (Mean Absolute Percentage Error) that will give you value between 0 and 1.

    Check out this link for more information about MAPE and how to use it using scikit-learn.