Following Jason Brownlee's tutorials, I developed my own Random forest classifier code. I paste it below, I would like to know what further improvements can I do to improve the accuracy to my code
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, shuffle = True, random_state=0)
scaler = StandardScaler()
x_train = scaler.fit_transform(X_train)
x_test = scaler.transform(X_test)
# get a list of models to evaluate
def get_models():
models = dict()
# consider tree depths from 1 to 7 and None=full
depths = [i for i in range(1,8)] + [None]
for n in depths:
models[str(n)] = RandomForestClassifier(max_depth=n)
return models
# evaluate model using cross-validation
def evaluate_model(model, X, y):
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the results
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
return scores
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
# evaluate the model
scores = evaluate_model(model, X, y)
# store the results
results.append(scores)
names.append(name)
# summarize the performance along the way
print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()
The data, X is a matrix of (140,20000)
and y is (140,)
categorical.
I got the following results but would like to explore how to improve accuracy further.
>1 0.573 (0.107)
>2 0.650 (0.089)
>3 0.647 (0.118)
>4 0.676 (0.101)
>5 0.708 (0.103)
>6 0.698 (0.124)
>7 0.726 (0.121)
>None 0.700 (0.107)
Here's what stands out to me:
sklearn.model_selection.GridSearchCV
. This is fine, but it can get quite fiddly (imagine wanting to step over another hyperparameter).GridSearchCV
you don't need to do your own cross validation.I would probably do something like this:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
X, y = make_classification()
# Split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, shuffle=True, random_state=0)
# Make things for the cross validation.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
param_grid = {'max_depth': np.arange(3, 8)}
model = RandomForestClassifier(random_state=1)
# Create and train the cross validation.
clf = GridSearchCV(model, param_grid,
scoring='f1_weighted',
cv=cv, verbose=3)
clf.fit(X_train, y_train)
Take a look at clf.cv_results_
for the scores etc, which you can plot if you want. By default GridSearchCV
trains a final model on the best hyperparameters, so you can make predictions with clf
.
Almost forgot... you asked about improving the model :) Here are some ideas:
max_features
, n_estimators
, and min_samples_leaf
). But don't get too carried away with hyperparameter tuning.X
), or adding new ones.sklearn
, or take a look at xgboost
).