Search code examples
pythonlogistic-regression

Memory error when using logistic regression


I have an array with the shape 57159x924 which I will use as training data. 896 of these 924 columns are feature and the remaining labels. I want to use logistic regression on this, but when I use the fit function from logistic regression I get a memory error. I guess it is because it's too much data for my computer's memory to handle. Is there any way to get around this problem?

The code I want to use is

lr = LogisticRegression(random_state=1)
lr.fit(train_set, train_label)
lr.predict_proba(x_test)

And the following is the error

line 21, in main lr.fit(train_set, train_label)

....

return array(a, dtype, copy=False, order=order) MemoryError


Solution

  • You haven't given enough details to really understand the problem or give a definite answer, but here are a few options I hope will help:

    1. The amount of memory available might be configurable.
    2. Training over all the data at the same time would raise OOM problems in many contexts, which is why the common practice is to use SGD (stochastic gradient descent) by training over batches, i.e. introducing only subsets of the data every iteration and getting a global optimization solution in a stochastic sense. If I'm guessing correctly, you're using sklearn.linear_model.LogisticRegression, which has different "solvers". Maybe the saga solver will handle your situation better.
    3. There are other implementations out there, and some of them definitely have batching options built-in in a highly configurable way. And if worst comes to worst, implementing a logistic-regression model is fairly simple, and then you can batch easy as pie.

    Edit (due to discussion in comments):
    Here's a practical way to go about it, with a very very simple (and easy) example -

    from sklearn.linear_model import SGDClassifier
    import numpy as np
    import random
    
    X1 = np.random.multivariate_normal(mean=[10, 5], cov = np.diag([3, 8]), size=1000)  # diagonal covariance for simplicity
    Y1 = np.zeros((1000, 1))
    
    X2 = np.random.multivariate_normal(mean=[-4, 55], cov = np.diag([5, 1]), size=1000)  # diagonal covariance for simplicity
    Y2 = np.ones((1000, 1))
    
    X = np.vstack([X1, X2])
    Y = np.vstack([Y1, Y2]).reshape([2000,])
    
    sgd = SGDClassifier(loss='log', warm_start=True)  # as mentioned in answer. note that shuffle is defaulted to True.
    sgd.partial_fit(X, Y, classes = [0, 1])  # first time you need to say what your classes are
    
    for k in range(1000):
        batch_indexs = random.sample(range(2000), 20)
        sgd.partial_fit(X[batch_indexs, :], Y[batch_indexs])
    

    In practice you should be looking at the loss and accuracy and using a suitable while instead of for, but that much is left for the reader ;-)

    Note that you can control more than I've shown (like the number of iterations etc.), so you should read the documentation of SGDClassifier properly.
    Another thing to note is that there are different practices of batching. I just took a random subset every iteration, but some prefer to make sure every point in the data has been seen an equal amount of times (e.g. shuffle the data and then take batches in order indexes or something).