python machine-learning classification random-forest cross-validation

Random Forest further improvement

Following Jason Brownlee's tutorials, I developed my own Random forest classifier code. I paste it below, I would like to know what further improvements can I do to improve the accuracy to my code

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, shuffle = True, random_state=0)

scaler = StandardScaler()
x_train = scaler.fit_transform(X_train)

x_test = scaler.transform(X_test)



# get a list of models to evaluate
def get_models():
    models = dict()
    # consider tree depths from 1 to 7 and None=full
    depths = [i for i in range(1,8)] + [None]
    for n in depths:
        models[str(n)] = RandomForestClassifier(max_depth=n)
    return models

# evaluate  model using cross-validation
def evaluate_model(model, X, y):
    # define the evaluation procedure
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    # evaluate the model and collect the results
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    return scores


# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X, y)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

The data, X is a matrix of (140,20000) and y is (140,) categorical.

I got the following results but would like to explore how to improve accuracy further.

>1 0.573 (0.107)
>2 0.650 (0.089)
>3 0.647 (0.118)
>4 0.676 (0.101)
>5 0.708 (0.103)
>6 0.698 (0.124)
>7 0.726 (0.121)
>None 0.700 (0.107)

Solution

Here's what stands out to me:

You split the data but do not use the splits.
You're scaling the data, but tree-based methods like random forests do not need this step.
You are doing your own tuning loop, instead of using sklearn.model_selection.GridSearchCV. This is fine, but it can get quite fiddly (imagine wanting to step over another hyperparameter).
If you use GridSearchCV you don't need to do your own cross validation.
You're using accuracy for evaluation, which is usually not a great evaluation metric for multi-class classification. Weighted F1 is better.
If you're doing cross validation, you need to put the scaler in the CV loop (e.g. using a pipeline) because otherwise the scaler has seen the validation data... but you don't need a scaler for this learning algorithm so this point is moot.

I would probably do something like this:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

X, y = make_classification()

# Split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True, random_state=0)

# Make things for the cross validation.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
param_grid = {'max_depth': np.arange(3, 8)}
model = RandomForestClassifier(random_state=1)

# Create and train the cross validation.
clf = GridSearchCV(model, param_grid,
                   scoring='f1_weighted',
                   cv=cv, verbose=3)

clf.fit(X_train, y_train)

Take a look at clf.cv_results_ for the scores etc, which you can plot if you want. By default GridSearchCV trains a final model on the best hyperparameters, so you can make predictions with clf.

Almost forgot... you asked about improving the model :) Here are some ideas:

The above will help you tune on more hyperparameters (eg max_features, n_estimators, and min_samples_leaf). But don't get too carried away with hyperparameter tuning.
You could try transforming some features (columns in X), or adding new ones.
Look for more data, eg more rows, higher quality labels, etc.
Address any issues with class imbalance.
Try a more sophisticated algorithm, like gradient boosted trees (there are models in sklearn, or take a look at xgboost).