I have a pandas data frame that contains several columns. I need to perform a multivariate linear regression. Before doing that i would like to analyze the R,R2,adjusted R2 and p value of each independent variable with respect to the dependent variable. For the R and R2 I have no problem, since i can calculate the R matrix and the select only the dependent variable and then see the R coefficient between it and all the independent variables. Then i can square these values to obtain the R2. My problem is how to do the same with the adjusted R2 and the p value At the end what i want to obtain is somenthing like that:
Variable R R2 ADJUSTED_R2 p_value
A 0.4193 0.1758 ...
B 0.2620 0.0686 ...
C 0.2535 0.0643 ...
All the values are with respect to the dependent variable let's say Y.
The following will not give you ALL the answers, but it WILL get you going using python, pandas and statsmodels for regression analyses.
Given a dataframe like this...
# Imports
import pandas as pd
import numpy as np
import itertools
# A datafrane with random numbers
np.random.seed(123)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df_1 = df_1.set_index(rng)
print(df_1)
...you can get any regression results using the statsmodels library and altering the result = model.rsquared
part in the snippet below:
x = df_1['x1']
x = sm.add_constant(x)
model = sm.OLS(df_1['y'], x).fit()
result = model.rsquared
print(result)
Now you have r-squared. Use model.pvalues
for the p-value. And use dir(model)
to have closer look at other model results (there is more in the output than what you can see below):
Now, this should get you going to obtain your desired results. To get desired results for ALL combinations of variables / columns, the question and answer here should get you very far.
Edit: You can have a closer look at some common regression results using model.summary()
. Using that together with dir(model)
you can see that not ALL regression results are availabel the same way that pvalues are using model.pvalues
. To get Durbin-Watson, for example, you'll have to use durbinwatson = sm.stats.stattools.durbin_watson(model.fittedvalues, axis=0)
.
This post has got more information on the issue.