I want to run train_test_split from sklearn package, using the same target variable y, but three different dataframes of independent variables. Then, I want to fit and predict using a Random Forest Classifier and get the accuracy. The goal here is to get accuracies for the three different dataframes so that I can compare them and select my variables accordingly. I have the following so far, which is not working.
df = [X1, X2, X3] # 3 different independent variable (features) DataFrames.
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn import metrics
rf_accuracy = []
for index, z in enumerate(df):
train_X, test_X, train_y, test_y = train_test_split(z, y,train_size=0.5,test_size=0.5, random_state=2)
rf = RandomForest(random_state=99)
rf.fit(train_X, train_y.ravel())
pred_y = rf.predict(test_X)
rf_accuracy = rf_accuracy.append(metrics.accuracy_score(test_y, pred_y))
print(rf_accuracy)
When I print the rf_accuracy, I should get a list with three accuracies from using three different feature spaces X1, X2, X3
, respectively.
For example, rf_accuracy
will output [0.9765, 0.9645, 0.9212]
I guess that your data are like this
assert df.shape == (n_samples, 3) # each column for a variable/features
assert y.shape == (n_samples, )
and you are trying to train three RF clfs on the three different variables/features respectively.
Now, you can try this
for _, z in df.iteritems():
train_X, test_X, train_y, test_y = train_test_split(
z.values.reshape(-1, 1), y, train_size=0.5, test_size=0.5, random_state=2)
rf = RandomForest(random_state=99)
rf.fit(train_X, train_y.ravel())
pred_y = rf.predict(test_X)
rf_accuracy = rf_accuracy.append(metrics.accuracy_score(test_y, pred_y))
print(rf_accuracy)
I succeeded in working on the iris dataset.
New: my modification