Search code examples
pythonperformancemachine-learningscikit-learnknn

How to train up a (k-NN) model on additional data (for the sake of plotting a learning curve)


I am playing with MNIST database for which I would like to plot a learning curve for a various learning algorithms. For the sake of this question let us consider k-NN algorithm.

I imported the data using mnist package and I transformed it to the numpy.ndarray objects.

import numpy as np
import matplotlib.pyplot as plt
from mnist import MNIST
mndata = MNIST('./data')

images_train, labels_train = mndata.load_training()
images_test, labels_test = mndata.load_testing()

labels_train = labels_train.tolist()
labels_test = labels_test.tolist()

X_train = np.array(images_train)
y_train = np.array(labels_train)
X_test = np.array(images_test)
y_test = np.array(labels_test)

However, it contains 60.000 examples in training set so it too much for my computer. I want to plot the learning curve to find out if further training makes any sense.

import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

start_time = time.time()

training_range = range(500, 1500, 100)
test_size = 1000

training_accuracy = []
test_accuracy = []

for train_size in training_range:
    X_train_small = X_train[:train_size]
    y_train_small = y_train[:train_size]
    X_test_small = X_test[:test_size]
    y_test_small = y_test[:test_size]

    clf = KNeighborsClassifier(n_neighbors=3)
    clf.fit(X_train_small, y_train_small)
    training_accuracy.append(clf.score(X_train_small, y_train_small))
    test_accuracy.append(clf.score(X_test_small, y_test_small))

plt.plot(training_range, training_accuracy, label="training accuracy")
plt.plot(training_range, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("Training size")
plt.title("Learning curve")
plt.legend()
plt.show()

Output:

enter image description here

It takes over a minute to plot this simple graph, which at its best shows accuracy from training just on 1500 elements.

The main problem is that the program runs clf.fit(X_train_small, y_train_small) multiple times and recalculates everything from scratch every time.

Question. Is there a way to preserve already learned data and just "train up" on the new one?

I guess the answer is no for arbitrary algorithm, but k-NN works in such a way that in principle it should be possible (it is just my opinion).


Solution

  • As Vivek Kumar says, only algorithms that are capable of calling partial_fit() method can do what you want, such aslinear_model.Perceptron,linear_model.SGDClassifier, etc.

    Why KNN does not have a partial fit? Because when you think of KNN, it does not have any effort on the training phase, it is a lazy algorithm. All the effort is spent during the testing phase. It needs the complete dataset to decide. Since it needs complete training set to be able to decide, it is meaningless to give the training data one piece at a time.