Search code examples
pythonpandasloopslinear-regression

How do I create a linear regression model for a file that has about 500 columns as y variables? Working with Python


This code manually selects a column from the y table and then joins it to the X table. The program then performs linear regression. Any idea how to do this for every single column from the y table?

yDF = pd.read_csv('ytable.csv')
yDF.drop('Dates', axis = 1, inplace = True)
XDF = pd.read_csv('Xtable.csv')
ycolumnDF = yDF.iloc[:,0].to_frame()
regressionDF = pd.concat([XDF,ycolumnDF], axis=1)

X = regressionDF.iloc[:,1:20]
y = regressionDF.iloc[:,20:].squeeze()

lm = linear_model.LinearRegression()
lm.fit(X,y)
cf = lm.coef_
print(cf)

Solution

  • You can regress multiple y's on the same X's at the same time. Something like this should work

    import numpy as np
    from sklearn.linear_model import LinearRegression
    
    df_X = pd.DataFrame(columns = ['x1','x2','x3'], data = np.random.normal(size = (10,3)))
    df_y = pd.DataFrame(columns = ['y1','y2'], data = np.random.normal(size = (10,2)))
    X = df_X.iloc[:,:]
    y = df_y.iloc[:,:]
    lm = LinearRegression().fit(X,y)
    print(lm.coef_)
    

    produces

    [[ 0.16115884  0.08471495  0.39169592]
     [-0.51929011  0.29160846 -0.62106353]]
    

    The first row here ([ 0.16115884 0.08471495 0.39169592]) are the regression coefs of y1 on xs and the second are the regression coefs of y2 on xs.