I'm trying to build a random forest classifier for binomial classification. Can someone explain why my accuracy scores vary every time I run this program? Scores vary anything between 68% - 74%. Also, I tried tweaking the parameters but I can't get the accuracy to go above 74. Any suggestions on this also would be appreciated. I tried using GridSearchCV but I managed only a decent 3 point increase.
#import libraries
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
#read data into pandas dataframe
df = pd.read_csv("data.csv")
#handle missing values
df = df.dropna(axis = 0, how = 'any')
#handle string-type data
le = preprocessing.LabelEncoder()
le.fit(['Male','Female'])
df.loc[:,'Sex'] = le.transform(df['Sex'])
#split into train and test data
df['is_train'] = np.random.uniform(0, 1, len(df)) <= 0.8
train, test = df[df['is_train'] == True], df[df['is_train'] == False]
#make an array of columns
features = df.columns[:10]
#build the classifier
clf = RandomForestClassifier()
#train the classifier
y = train['Selector']
clf.fit(train[features], train['Selector'])
#test the classifier
clf.predict(test[features])
#calculate accuracy
accuracy_score(test['Selector'], clf.predict(test[features]))
accuracy_score(train['Selector'], clf.predict(train[features]))
Your accuracy changes every time you run the program because the model created is different. And the model is different because you are not fixing the random state when creating it. Have a look at the random_state
parameter from the scikit-learn documentation.
For your second question, there are many things you can try in order to improve the accuracy of a model. In order of importance: