I have an array with the shape 57159x924 which I will use as training data. 896 of these 924 columns are feature and the remaining labels. I want to use logistic regression on this, but when I use the fit function from logistic regression I get a memory error. I guess it is because it's too much data for my computer's memory to handle. Is there any way to get around this problem?
The code I want to use is
lr = LogisticRegression(random_state=1)
lr.fit(train_set, train_label)
lr.predict_proba(x_test)
And the following is the error
line 21, in main lr.fit(train_set, train_label)
....
return array(a, dtype, copy=False, order=order) MemoryError
You haven't given enough details to really understand the problem or give a definite answer, but here are a few options I hope will help:
saga
solver will handle your situation better. Edit (due to discussion in comments):
Here's a practical way to go about it, with a very very simple (and easy) example -
from sklearn.linear_model import SGDClassifier
import numpy as np
import random
X1 = np.random.multivariate_normal(mean=[10, 5], cov = np.diag([3, 8]), size=1000) # diagonal covariance for simplicity
Y1 = np.zeros((1000, 1))
X2 = np.random.multivariate_normal(mean=[-4, 55], cov = np.diag([5, 1]), size=1000) # diagonal covariance for simplicity
Y2 = np.ones((1000, 1))
X = np.vstack([X1, X2])
Y = np.vstack([Y1, Y2]).reshape([2000,])
sgd = SGDClassifier(loss='log', warm_start=True) # as mentioned in answer. note that shuffle is defaulted to True.
sgd.partial_fit(X, Y, classes = [0, 1]) # first time you need to say what your classes are
for k in range(1000):
batch_indexs = random.sample(range(2000), 20)
sgd.partial_fit(X[batch_indexs, :], Y[batch_indexs])
In practice you should be looking at the loss and accuracy and using a suitable while
instead of for
, but that much is left for the reader ;-)
Note that you can control more than I've shown (like the number of iterations etc.), so you should read the documentation of SGDClassifier properly.
Another thing to note is that there are different practices of batching. I just took a random subset every iteration, but some prefer to make sure every point in the data has been seen an equal amount of times (e.g. shuffle the data and then take batches in order indexes or something).