Search code examples
pythonpandaslinear-regressionp-valuepearson-correlation

Adjusted R square for each predictor variable in python


I have a pandas data frame that contains several columns. I need to perform a multivariate linear regression. Before doing that i would like to analyze the R,R2,adjusted R2 and p value of each independent variable with respect to the dependent variable. For the R and R2 I have no problem, since i can calculate the R matrix and the select only the dependent variable and then see the R coefficient between it and all the independent variables. Then i can square these values to obtain the R2. My problem is how to do the same with the adjusted R2 and the p value At the end what i want to obtain is somenthing like that:

 Variable     R        R2       ADJUSTED_R2   p_value
 A            0.4193   0.1758   ...
 B            0.2620   0.0686   ...
 C            0.2535   0.0643   ...

All the values are with respect to the dependent variable let's say Y.


Solution

  • The following will not give you ALL the answers, but it WILL get you going using python, pandas and statsmodels for regression analyses.


    Given a dataframe like this...

    # Imports
    import pandas as pd
    import numpy as np
    import itertools
    
    # A datafrane with random numbers
    np.random.seed(123)
    rows = 12
    listVars= ['y','x1', 'x2', 'x3']
    rng = pd.date_range('1/1/2017', periods=rows, freq='D')
    df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars) 
    df_1 = df_1.set_index(rng)
    
    print(df_1)
    

    enter image description here

    ...you can get any regression results using the statsmodels library and altering the result = model.rsquared part in the snippet below:

    x = df_1['x1']
    x = sm.add_constant(x)
    model = sm.OLS(df_1['y'], x).fit()    
    result = model.rsquared
    print(result)
    

    enter image description here

    Now you have r-squared. Use model.pvalues for the p-value. And use dir(model)to have closer look at other model results (there is more in the output than what you can see below):

    enter image description here

    Now, this should get you going to obtain your desired results. To get desired results for ALL combinations of variables / columns, the question and answer here should get you very far.

    Edit: You can have a closer look at some common regression results using model.summary(). Using that together with dir(model) you can see that not ALL regression results are availabel the same way that pvalues are using model.pvalues. To get Durbin-Watson, for example, you'll have to use durbinwatson = sm.stats.stattools.durbin_watson(model.fittedvalues, axis=0). This post has got more information on the issue.