Search code examples
python-3.xscikit-learnsklearn-pandas

Running train-test split and obtaining model accuracies for different datasets


I want to run train_test_split from sklearn package, using the same target variable y, but three different dataframes of independent variables. Then, I want to fit and predict using a Random Forest Classifier and get the accuracy. The goal here is to get accuracies for the three different dataframes so that I can compare them and select my variables accordingly. I have the following so far, which is not working.

df = [X1, X2, X3]   # 3 different independent variable (features) DataFrames. 

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier as RandomForest
from sklearn import metrics
rf_accuracy = []

for index, z in enumerate(df):
  train_X, test_X, train_y, test_y = train_test_split(z, y,train_size=0.5,test_size=0.5, random_state=2)
  rf = RandomForest(random_state=99)
  rf.fit(train_X, train_y.ravel())
  pred_y = rf.predict(test_X)
  rf_accuracy = rf_accuracy.append(metrics.accuracy_score(test_y, pred_y))

print(rf_accuracy)

When I print the rf_accuracy, I should get a list with three accuracies from using three different feature spaces X1, X2, X3, respectively.

For example, rf_accuracy will output [0.9765, 0.9645, 0.9212]


Solution

  • I guess that your data are like this

    assert df.shape == (n_samples, 3)   # each column for a variable/features
    assert y.shape == (n_samples, )
    

    and you are trying to train three RF clfs on the three different variables/features respectively.

    Now, you can try this

    for _, z in df.iteritems():
      train_X, test_X, train_y, test_y = train_test_split(
    z.values.reshape(-1, 1), y, train_size=0.5, test_size=0.5, random_state=2)
      rf = RandomForest(random_state=99)
      rf.fit(train_X, train_y.ravel())
      pred_y = rf.predict(test_X)
      rf_accuracy = rf_accuracy.append(metrics.accuracy_score(test_y, pred_y))
    
    print(rf_accuracy)
    

    I succeeded in working on the iris dataset.

    New: my modification