I'm trying to use online(out-of-core) learning algorithm for MNIST problem using SGDClassifier But it seems that accuracy not always increasing.
What should I do in this case? save somehow classifer with best accuracy? Is SGDClassifier converging to some optimal solution?
Here is my code:
import numpy as np
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn.datasets import fetch_mldata
from sklearn.utils import shuffle
#use all digits
mnist = fetch_mldata("MNIST original")
X_train, y_train = mnist.data[:70000] / 255., mnist.target[:70000]
X_train, y_train = shuffle(X_train, y_train)
X_test, y_test = X_train[60000:70000], y_train[60000:70000]
step =1000
batches= np.arange(0,60000,step)
all_classes = np.array([0,1,2,3,4,5,6,7,8,9])
classifier = SGDClassifier()
for curr in batches:
X_curr, y_curr = X_train[curr:curr+step], y_train[curr:curr+step]
classifier.partial_fit(X_curr, y_curr, classes=all_classes)
score= classifier.score(X_test, y_test)
print score
print "all done"
I tested linearSVM vs SGD on MNIST using 10k samples for train and 10k for test and get 0.883 13,95 and 0.85 1,32 so SGD faster but less accurate.
#test linearSVM vs SGD
t0 = time.time()
clf = LinearSVC()
clf.fit(X_train, y_train)
score= clf.score(X_test, y_test)
print score
print (time.time()-t0)
t1 = time.time()
clf = SGDClassifier()
clf.fit(X_train, y_train)
score= clf.score(X_test, y_test)
print score
print (time.time()-t1)
also I found some info here https://stats.stackexchange.com/a/14936/16843
UPDATE: more then one pass (10 passes) through the data achived best accuracy 90.8 %.So it can be solution. And another specificity of SGD that data must be shuffled before passed to classifier.
First remark: you are using SGDClassifier
with the default parameters: they are likely not the optimal values for this dataset: try other values as well (especially for alpha, the regularization parameter).
Now to answer your question it's quite unlikely that a linear model will do very good on a dataset like MNIST which is digit image classification task. You might want to try linear models such as:
SVC(kernel='rbf')
(but not scalable, try on a small subset of the training set) and not incremental / out-of-coreExtraTreesClassifier(n_estimator=100)
or more but not out-of-core either. The larger the number of sub estimators, the longer it will take to train.You can also try the Nystroem approximation of SVC(kernel='rbf')
by transforming the dataset using a Nystroem(n_components=1000, gamma=0.05)
fitted on a small subset of the data (e.g. 10000 samples) and then passing the whole transformed training set to a linear model such as SGDClassifier
: it requires 2 passes over the dataset.
There is also a pull request for 1 hidden layer perceptron on github that should be both faster to to compute than ExtraTreesClassifier
and approach 98% test set accuracy on MNIST (and also provide a partial_fit API for out-of-core learning).
Edit: the fluctuation of the estimate of the SGDClassifier
score is expected: SGD stands for stochastic gradient descent, which means that examples are considered one at a time: badly classified samples can cause an update of the weights of the model in a way that is detrimental for other samples, you need to do more than one pass over the data to make the learning rate decrease enough to get a smoother estimate of the validation accuracy. You can use itertools.repeat in your for loop to do several passes (e.g. 10) over your dataset.