Search code examples
pythonnumpyscikit-learnlinear-regression

Efficient online linear regression algorithm in python


I got a 2-D dataset with two columns x and y. I would like to get the linear regression coefficients and interception dynamically when new data feed in. Using scikit-learn I could calculate all current available data like this:

from sklearn.linear_model import LinearRegression
regr = LinearRegression()
x = np.arange(100)
y = np.arange(100)+10*np.random.random_sample((100,))
regr.fit(x,y)
print(regr.coef_)
print(regr.intercept_)

However, I got quite big dataset (more than 10k rows in total) and I want to calculate coefficient and intercept as fast as possible whenever there's new rows coming in. Currently calculate 10k rows takes about 600 microseconds, and I want to accelerate this process.

Scikit-learn looks like does not have online update function for linear regression module. Is there any better ways to do this?


Solution

  • I've found solution from this paper: updating simple linear regression. The implementation is as below:

    def lr(x_avg,y_avg,Sxy,Sx,n,new_x,new_y):
        """
        x_avg: average of previous x, if no previous sample, set to 0
        y_avg: average of previous y, if no previous sample, set to 0
        Sxy: covariance of previous x and y, if no previous sample, set to 0
        Sx: variance of previous x, if no previous sample, set to 0
        n: number of previous samples
        new_x: new incoming 1-D numpy array x
        new_y: new incoming 1-D numpy array x
        """
        new_n = n + len(new_x)
    
        new_x_avg = (x_avg*n + np.sum(new_x))/new_n
        new_y_avg = (y_avg*n + np.sum(new_y))/new_n
    
        if n > 0:
            x_star = (x_avg*np.sqrt(n) + new_x_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
            y_star = (y_avg*np.sqrt(n) + new_y_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
        elif n == 0:
            x_star = new_x_avg
            y_star = new_y_avg
        else:
            raise ValueError
    
        new_Sx = Sx + np.sum((new_x-x_star)**2)
        new_Sxy = Sxy + np.sum((new_x-x_star).reshape(-1) * (new_y-y_star).reshape(-1))
    
        beta = new_Sxy/new_Sx
        alpha = new_y_avg - beta * new_x_avg
        return new_Sxy, new_Sx, new_n, alpha, beta, new_x_avg, new_y_avg
    

    Performance comparison:

    Scikit learn version that calculate 10k samples altogether.

    from sklearn.linear_model import LinearRegression
    x = np.arange(10000).reshape(-1,1)
    y = np.arange(10000)+100*np.random.random_sample((10000,))
    regr = LinearRegression()
    %timeit regr.fit(x,y)
    # 419 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    My version assume 9k sample is already calculated:

    Sxy, Sx, n, alpha, beta, new_x_avg, new_y_avg = lr(0, 0, 0, 0, 0, x.reshape(-1,1)[:9000], y[:9000])
    new_x, new_y = x.reshape(-1,1)[9000:], y[9000:]
    %timeit lr(new_x_avg, new_y_avg, Sxy,Sx,n,new_x, new_y)
    # 38.7 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    10 times faster, which is expected.