I have a huge number of feature and targets, but small number of samples. It turns out, that in such case a simple linear regression with numpy.linalg.pinv
works much better than sklearn.linear_model.LinearRegression
.
For example, n_features=n_targets=30000
, and n_samples=2
:
X = np.random.random((2, 30000))
Y = np.random.random((2, 30000))
x = np.random.random((1, 30000))
y = Y.T @ (np.linalg.pinv(X.T) @ x.T) # works fast and use a little memory
reg = LinearRegression(fit_intercept=False).fit(X,Y) # slow and uses a lot of memory
y2 = reg.predict(x)
np.linalg.norm(y2 - y) < 1e-12
The problem is that reg.coef_.shape
is (30000, 30000)
. So LinearRegression
don't exploit low-rank of the regression.
Is there any way to get an analog of a simple pinv
solution with sklearn framework?
I tried to search, but failed to find anything usefull.
I solved the problem using sklearn.kernel_ridge.KernelRidge
class with alpha=1e-15, kernel='rbf'
parameters. For my task it shows the same speed, memory requirements, and accuracy as the pinv
-based solution. The code is compact too:
reg = KernelRidge(alpha=1e-15, kernel='rbf')
reg.fit(X, Y)
y3 = reg.predict(x)
I know, that it is not a linear model, but it works as well. Perhaps there is some kernel that works the same way as a linear regression. I think the polynomial
kernel with degree=1
should work that way, but I haven't tried it.