python machine-learning scikit-learn mnist

MNIST and SGDClassifier classifer

I'm trying to use online(out-of-core) learning algorithm for MNIST problem using SGDClassifier But it seems that accuracy not always increasing.

What should I do in this case? save somehow classifer with best accuracy? Is SGDClassifier converging to some optimal solution?

Here is my code:

import numpy as np
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn.datasets import fetch_mldata
from sklearn.utils import shuffle

#use all digits
mnist = fetch_mldata("MNIST original")
X_train, y_train = mnist.data[:70000] / 255., mnist.target[:70000]

X_train, y_train = shuffle(X_train, y_train)
X_test, y_test = X_train[60000:70000], y_train[60000:70000]  

step =1000
batches= np.arange(0,60000,step)
all_classes = np.array([0,1,2,3,4,5,6,7,8,9])
classifier = SGDClassifier()
for curr in batches:
 X_curr, y_curr = X_train[curr:curr+step], y_train[curr:curr+step]
 classifier.partial_fit(X_curr, y_curr, classes=all_classes)
 score= classifier.score(X_test, y_test)
 print score

print "all done"

I tested linearSVM vs SGD on MNIST using 10k samples for train and 10k for test and get 0.883 13,95 and 0.85 1,32 so SGD faster but less accurate.

#test linearSVM vs SGD
t0 = time.time()
clf = LinearSVC()
clf.fit(X_train, y_train)
score= clf.score(X_test, y_test)
print score
print (time.time()-t0)

t1 = time.time()
clf = SGDClassifier()
clf.fit(X_train, y_train)
score= clf.score(X_test, y_test)
print score
print (time.time()-t1)

also I found some info here https://stats.stackexchange.com/a/14936/16843

UPDATE: more then one pass (10 passes) through the data achived best accuracy 90.8 %.So it can be solution. And another specificity of SGD that data must be shuffled before passed to classifier.

Solution

First remark: you are using SGDClassifier with the default parameters: they are likely not the optimal values for this dataset: try other values as well (especially for alpha, the regularization parameter).

Now to answer your question it's quite unlikely that a linear model will do very good on a dataset like MNIST which is digit image classification task. You might want to try linear models such as:

SVC(kernel='rbf') (but not scalable, try on a small subset of the training set) and not incremental / out-of-core
ExtraTreesClassifier(n_estimator=100) or more but not out-of-core either. The larger the number of sub estimators, the longer it will take to train.

You can also try the Nystroem approximation of SVC(kernel='rbf') by transforming the dataset using a Nystroem(n_components=1000, gamma=0.05) fitted on a small subset of the data (e.g. 10000 samples) and then passing the whole transformed training set to a linear model such as SGDClassifier: it requires 2 passes over the dataset.

There is also a pull request for 1 hidden layer perceptron on github that should be both faster to to compute than ExtraTreesClassifier and approach 98% test set accuracy on MNIST (and also provide a partial_fit API for out-of-core learning).

Edit: the fluctuation of the estimate of the SGDClassifier score is expected: SGD stands for stochastic gradient descent, which means that examples are considered one at a time: badly classified samples can cause an update of the weights of the model in a way that is detrimental for other samples, you need to do more than one pass over the data to make the learning rate decrease enough to get a smoother estimate of the validation accuracy. You can use itertools.repeat in your for loop to do several passes (e.g. 10) over your dataset.