Search code examples
pythonpandasdataframelinear-regressionscatter

How to create 1 linear regression for two groups of data


I have two scatterplots that I've placed on one plot. I want to find the linear regression line for the points of y1 and y2 combined (as in the regression between x and (y1 and y2) ), but I'm having difficulty since I usually only find the regression line for y1 or y2 separately. I also want to find the r^2 value (for the combined y1 and y2). I would appreciate any help I can get!

df1 = pd.DataFrame(np.random.randint(0,100,size=(15, 2)), columns=list('AB'))

y1 = df1['A']
y2 = df1['B']

plt.scatter(df1.index, y1)
plt.scatter(df1.index, y2)
plt.show()

Solution

  • Sounds like you want to 'stack' columns A and B together; many ways to do it, here is one using stack:

    df2 = df1.stack().rename('A_and_B').reset_index(level = 1, drop = True).to_frame()
    

    Then df.head() looks like this:

    
        A_and_B
    0   35
    0   58
    1   49
    1   73
    2   44
    

    and the scatter plot:

    plt.scatter(df2.index, df2['A_and_B'])
    

    looks like enter image description here

    I don't know how you do regressions, you can apply your method to df2 now. For example:

    import statsmodels.api as sm
    res = sm.OLS(df2['A_and_B'], df2.index).fit()
    res.summary()
    

    output:

                                     OLS Regression Results                                
    =======================================================================================
    Dep. Variable:                A_and_B   R-squared (uncentered):                   0.517
    Model:                            OLS   Adj. R-squared (uncentered):              0.501
    Method:                 Least Squares   F-statistic:                              31.10
    Date:                Mon, 14 Mar 2022   Prob (F-statistic):                    5.11e-06
    Time:                        23:02:47   Log-Likelihood:                         -152.15
    No. Observations:                  30   AIC:                                      306.3
    Df Residuals:                      29   BIC:                                      307.7
    Df Model:                           1                                                  
    Covariance Type:            nonrobust                                                  
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    x1             4.8576      0.871      5.577      0.000       3.076       6.639
    ==============================================================================
    Omnibus:                        3.466   Durbin-Watson:                   1.244
    Prob(Omnibus):                  0.177   Jarque-Bera (JB):                1.962
    Skew:                          -0.371   Prob(JB):                        0.375
    Kurtosis:                       1.990   Cond. No.                         1.00
    ==============================================================================
    
    Notes:
    [1] R² is computed without centering (uncentered) since the model does not contain a constant.
    [2] Standard Errors assume that the covariance matrix of the errors is correctly specified.