Search code examples
pythonpandassklearn-pandas

Pandas/sklearn: Vectorize large number of LinearRegression calculations


I have a Pandas DataFrame where I need to calculate a large numbers of regression coefficients. Each calculation will be only two dimensional. The independent variable will be a ['Base'] which is the same for all cases. The dependent variable series is organized along columns in my DataFrame.

This is easy to accomplish with a for loop but in my real life DataFrame I have thousands of columns on which to run the regression, so it takes forever. Is there a vectorized way to accomplish this?

Below is a MRE:

import pandas as pd
import numpy as np
from sklearn import linear_model
import time

df_data = {
        'Base':np.random.randint(1, 100, 1000),
        'Adder':np.random.randint(-3, 3, 1000)}

df = pd.DataFrame(data=df_data)
result_df = pd.DataFrame()

df['Thing1'] = df['Base'] * 3 + df['Adder']
df['Thing2'] = df['Base'] * 6 + df['Adder']
df['Thing3'] = df['Base'] * 12 + df['Adder']
df['Thing4'] = df['Base'] * 4 + df['Adder']
df['Thing5'] = df['Base'] * 2.67 + df['Adder']

things = ['Thing1', 'Thing2', 'Thing3', 'Thing4', 'Thing5']

for t in things:
    reg = linear_model.LinearRegression()
    X, y = df['Base'].values.reshape(-1,1), df[t].values.reshape(-1,1)
    reg.fit(X, y)
    b = reg.coef_[0][0]
    result_df.loc[t, 'Beta'] = b

print(result_df.to_string())


Solution

  • You can use np.polyfit for linear regression:

    pd.DataFrame(np.polyfit(df['Base'], df.filter(like='Thing'), deg=1)).T
    

    Output:

               0            1
    0   3.002379    -0.714256
    1   6.002379    -0.714256
    2   12.002379   -0.714256
    3   4.002379    -0.714256
    4   2.672379    -0.714256