Search code examples
scikit-learnlinear-regression

LinearRegression with large number of features and targets, but small number of samples runs out of memory


I have a huge number of feature and targets, but small number of samples. It turns out, that in such case a simple linear regression with numpy.linalg.pinv works much better than sklearn.linear_model.LinearRegression.

For example, n_features=n_targets=30000, and n_samples=2:

X = np.random.random((2, 30000))
Y = np.random.random((2, 30000))
x = np.random.random((1, 30000)) 

y = Y.T  @  (np.linalg.pinv(X.T)  @ x.T) # works fast and use a little memory

reg = LinearRegression(fit_intercept=False).fit(X,Y) # slow and uses a lot of memory
y2 = reg.predict(x)

np.linalg.norm(y2 - y) < 1e-12

The problem is that reg.coef_.shape is (30000, 30000). So LinearRegression don't exploit low-rank of the regression.

Is there any way to get an analog of a simple pinv solution with sklearn framework?

I tried to search, but failed to find anything usefull.


Solution

  • I solved the problem using sklearn.kernel_ridge.KernelRidge class with alpha=1e-15, kernel='rbf' parameters. For my task it shows the same speed, memory requirements, and accuracy as the pinv-based solution. The code is compact too:

    reg = KernelRidge(alpha=1e-15, kernel='rbf')
    reg.fit(X, Y)
    y3 = reg.predict(x)
    

    I know, that it is not a linear model, but it works as well. Perhaps there is some kernel that works the same way as a linear regression. I think the polynomial kernel with degree=1 should work that way, but I haven't tried it.