python machine-learning scikit-learn random-forest kaggle

Why I can not find lowest mean absolute error using Random Forest?

I am doing Kaggle competition with the following dataset: https://www.kaggle.com/c/home-data-for-ml-course/download/train.csv

According to the theory, by increasing number of estimators in Random Forest model the mean absolute error would drop only until some number (sweet spot) and further increase would cause overfitting. By plotting number of estimators and mean absolute errors we should get this red graph, were lowest point marks the best number of estimators.

I try to find best number of estimators with following code, but data plot shows that MAE is constantly decreasing. What do I do wrong?

train_data = pd.read_csv('train.csv')
y = train_data['SalePrice']
#for simplicity dropping all columns with missing values and non-numerical values
X = train_data.drop('SalePrice', axis=1).dropna(axis=1).select_dtypes(['number'])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
mae_list = []
for n_estimators in range(10, 800, 10):
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=0, n_jobs=8)
    rf_model.fit(X_train, y_train)
    preds = rf_model.predict(X_test)
    mae = mean_absolute_error(y_test, preds)
    mae_list.append({'n_est': n_estimators, 'mae': mae})

#plotting the results
plt.plot([item['n_est'] for item in mae_list], [item['mae'] for item in mae_list])

Solution

You are not necessarily doing something wrong.

Looking more closely to the theoretical curves you show, you'll notice that the horizontal axis does not contain the slightest indication of the actual number of trees/iterations where such a minimum should happen. And this is a rather general feature of such theoretical predictions - they tell you something is expected, but nothing about where exactly (or even roughly) you should expect it.

Keeping this in mind, the only thing I can conclude from your second plot is that, in the specific range of ~ 800 trees you have tried, you are actually still in the "left" of the expected minimum.

Again, there is no theoretical prediction of how many trees (800 or 8,000 or...) you should add before reaching that minimum.

To bring some empirical corroboration into the discussion: in my own first Kaggle competition, we kept adding trees until we reached a number of ~ 24,000, before our validation error started diverging (we were using GBM and not RF, but the rationale is identical).