My input data file is under the form:
gold, callersAtLeast1T, CalleesAtLeast1T, ...
T,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
I am trying to predict the first column (gold) based on the values of the remaining columns, here is the code that I am using:
import pandas as pd
import numpy as np
dataset = pd.read_csv( 'data1extended.txt', sep= ',')
#convert T into 1 and N into 0
dataset['gold'] = dataset['gold'].astype('category').cat.codes
print(dataset.head())
row_count, column_count = dataset.shape
X = dataset.iloc[:, 1:column_count].values
y = dataset.iloc[:, 0].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))
The 3 last lines of my code cause an error, how to fix it?
This line causes the error:
print(confusion_matrix(y_test,y_pred))
I printed y_test and y_pred and here is what I obtained:
y_test is: [0 0 0 ... 0 0 0]
y_pred is: [0.0007123 0.00402548 0.00402548 ... 0.00402548 0.02651928 0.00816086]
You're using RandomForestRegressor which outputs continuous value output i.e. a real number whereas confusion matrix is expecting a category value output i.e. discrete number output 0,1,2 and so on.
Since you're trying to predict classes i.e. either 1 or 0 you can do two things:
1.) Use RandomForestClassifier instead of RandomForestRegressor which will output 0 or 1 and you can use it for getting your metrics. (Recommended)
2.) If you want real valued output only, you can set a threshold i.e.
y_pred = (y_pred < threshold).astype(int)
This'll transform your output real number to 1 if the number is less than threshold else 1 and use it for getting your metrics.