Search code examples
pythonpandaslinear-regressionstatsmodels

How to put linear regression results (variable name, p_value) into a dataframe using for loop?


I have 1 target variable and hundreds of predictor variables. I am trying to run linear regression on one predictor variable at once and then create a dataframe to save all the univariate linear regression results (namely - variable name, p_value) using a for loop.

here are my regression codes in python (X_data has all the predictor variables and y_data has the target variable:

import statsmodels.api as sm
for column in X_Data:
    exog = sm.add_constant(X_data[column],prepend = False)
    mod = sm.OLS(y_data, exog)
    res = mod.fit()
    print(column, ' ', res.pvalues[column])

the print results look like:

variable1 0.003
variable2 0.3

...

How can I create a pandas dataframe to save all the p_value results?


Solution

  • You can initialize an empty container, say a dict, before the loop then populate it and construct the DataFrame after.

    d = {}
    for column in X_Data:
        exog = sm.add_constant(X_data[column],prepend = False)
        mod = sm.OLS(y_data, exog)
        res = mod.fit()
        d[column] = res.pvalues[column])
    
    df = pd.DataFrame.from_dict(d, orient='index', columns=['pval'])
    #            pval
    #variable1  0.003
    #variable2  0.300
    

    If you need to store multiple pieces of information (coefficients, confidence intervals, standard errors...) then your dict can store a dict of attributes for each key.

    d = {}
    for column in X_Data:
        ...
        d[column] = {'pval': res.pvalues[column], 'other_feature': ...}
    
    print(d)
    #{'variable1': {'pval': 0.003, 'other_feature': 'XX'}, 
    # 'variable2': {'pval': 0.300, 'other_feature': 'YY'}}
    
    df = pd.DataFrame.from_dict(d, orient='index')
    #            pval  other_feature
    #variable1  0.003             XX
    #variable2  0.300             YY