I am newbie in Machine learning. Recently, I have learnt how to calculate confusion_matrix
for Test set
of KNN Classification
. But I do not know, how to calculate confusion_matrix
for Training set
of KNN Classification
?
How can I compute confusion_matrix
for Training set
of KNN Classification
from the following code ?
Following code is for computing confusion_matrix
for Test set
:
# Split test and train data
import numpy as np
from sklearn.model_selection import train_test_split
X = np.array(dataset.ix[:, 1:10])
y = np.array(dataset['benign_malignant'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#Define Classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, y_train)
# Predicting the Test set results
y_pred = knn.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred) # Calulate Confusion matrix for test set.
For k-fold cross-validation:
I am also trying to find confusion_matrix
for Training set
using k-fold cross-validation
.
I am confused to this line knn.fit(X_train, y_train)
.
Whether I will change this line knn.fit(X_train, y_train)
?
Where should I change following code
for computing confusion_matrix
for training set
?
# Applying k-fold Method
from sklearn.cross_validation import StratifiedKFold
kfold = 10 # no. of folds (better to have this at the start of the code)
skf = StratifiedKFold(y, kfold, random_state = 0)
# Stratified KFold: This first divides the data into k folds. Then it also makes sure that the distribution of the data in each fold follows the original input distribution
# Note: in future versions of scikit.learn, this module will be fused with kfold
skfind = [None]*len(skf) # indices
cnt=0
for train_index in skf:
skfind[cnt] = train_index
cnt = cnt + 1
# skfind[i][0] -> train indices, skfind[i][1] -> test indices
# Supervised Classification with k-fold Cross Validation
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
conf_mat = np.zeros((2,2)) # Initializing the Confusion Matrix
n_neighbors = 1; # better to have this at the start of the code
# 10-fold Cross Validation
for i in range(kfold):
train_indices = skfind[i][0]
test_indices = skfind[i][1]
clf = []
clf = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
# fit Training set
clf.fit(X_train,y_train)
# predict Test data
y_predcit_test = []
y_predict_test = clf.predict(X_test) # output is labels and not indices
# Compute confusion matrix
cm = []
cm = confusion_matrix(y_test,y_predict_test)
print(cm)
# conf_mat = conf_mat + cm
You dont have to make much changes
# Predicting the train set results
y_train_pred = knn.predict(X_train)
cm_train = confusion_matrix(y_train, y_train_pred)
Here instead of using X_test
we use X_train
for classification and then we produce a classification matrix using the predicted classes for the training dataset and the actual classes.
The idea behind a classification matrix is essentially to find out the number of classifications falling into four categories(if y
is binary) -
So as long as you have two sets - predicted and actual, you can create the confusion matrix. All you got to do is predict the classes, and use the actual classes to get the confusion matrix.
EDIT
In the cross validation part, you can add a line y_predict_train = clf.predict(X_train)
to calculate the confusion matrix for each iteration. You can do this because in the loop, you initialize the clf
everytime which basically means reseting your model.
Also, in your code you are finding the confusion matrix each time but you are not storing it anywhere. At the end you'll be left with a cm of just the last test set.