Search code examples
pythonmachine-learningclassificationrandom-forestcross-validation

Random Forest further improvement


Following Jason Brownlee's tutorials, I developed my own Random forest classifier code. I paste it below, I would like to know what further improvements can I do to improve the accuracy to my code

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, shuffle = True, random_state=0)

scaler = StandardScaler()
x_train = scaler.fit_transform(X_train)

x_test = scaler.transform(X_test)



# get a list of models to evaluate
def get_models():
    models = dict()
    # consider tree depths from 1 to 7 and None=full
    depths = [i for i in range(1,8)] + [None]
    for n in depths:
        models[str(n)] = RandomForestClassifier(max_depth=n)
    return models

# evaluate  model using cross-validation
def evaluate_model(model, X, y):
    # define the evaluation procedure
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    # evaluate the model and collect the results
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
    return scores


# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
    # evaluate the model
    scores = evaluate_model(model, X, y)
    # store the results
    results.append(scores)
    names.append(name)
    # summarize the performance along the way
    print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

The data, X is a matrix of (140,20000) and y is (140,) categorical.

I got the following results but would like to explore how to improve accuracy further.

>1 0.573 (0.107)
>2 0.650 (0.089)
>3 0.647 (0.118)
>4 0.676 (0.101)
>5 0.708 (0.103)
>6 0.698 (0.124)
>7 0.726 (0.121)
>None 0.700 (0.107)

Solution

  • Here's what stands out to me:

    • You split the data but do not use the splits.
    • You're scaling the data, but tree-based methods like random forests do not need this step.
    • You are doing your own tuning loop, instead of using sklearn.model_selection.GridSearchCV. This is fine, but it can get quite fiddly (imagine wanting to step over another hyperparameter).
    • If you use GridSearchCV you don't need to do your own cross validation.
    • You're using accuracy for evaluation, which is usually not a great evaluation metric for multi-class classification. Weighted F1 is better.
    • If you're doing cross validation, you need to put the scaler in the CV loop (e.g. using a pipeline) because otherwise the scaler has seen the validation data... but you don't need a scaler for this learning algorithm so this point is moot.

    I would probably do something like this:

    import numpy as np
    from sklearn.datasets import make_classification
    from sklearn.model_selection import RepeatedStratifiedKFold
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import train_test_split
    
    X, y = make_classification()
    
    # Split the data.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True, random_state=0)
    
    # Make things for the cross validation.
    cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
    param_grid = {'max_depth': np.arange(3, 8)}
    model = RandomForestClassifier(random_state=1)
    
    # Create and train the cross validation.
    clf = GridSearchCV(model, param_grid,
                       scoring='f1_weighted',
                       cv=cv, verbose=3)
    
    clf.fit(X_train, y_train)
    

    Take a look at clf.cv_results_ for the scores etc, which you can plot if you want. By default GridSearchCV trains a final model on the best hyperparameters, so you can make predictions with clf.

    Almost forgot... you asked about improving the model :) Here are some ideas:

    • The above will help you tune on more hyperparameters (eg max_features, n_estimators, and min_samples_leaf). But don't get too carried away with hyperparameter tuning.
    • You could try transforming some features (columns in X), or adding new ones.
    • Look for more data, eg more rows, higher quality labels, etc.
    • Address any issues with class imbalance.
    • Try a more sophisticated algorithm, like gradient boosted trees (there are models in sklearn, or take a look at xgboost).