Search code examples
pythonpandasfor-looplinear-regression

Create a for Loop on a multiple regression utilizing Pandas (StatsModels)


I am performing a multiple regression for 50 states to determine the life expectancy per state based on several variables. Currently I have my dataset filtered to only Maine, and I want to know if there is a way to create a For Loop to go through the whole State Column and perform a regression for each state. This would be more efficient than creating 50 filters. Any help would be great!

import pandas
import pandas as pd
import openpyxl
import statsmodels.formula.api as smf
import statsmodels.formula.api as ols

df = pd.read_excel(C:/Users/File1.xlsx, sheet_name = 'States')

dfME = df[(df[State] == "Maine")]

pd.set_option('display.max_columns', None)

dfME.head()

model = smf.ols(Life Expectancy ~ Race + Age + Weight + C(Pets), data = dfME) 
modelfit = model.fit()
modelfit.summary

Solution

  • ###### Assuming rest of your code is ok I am sharing a strategy for the loop and storing model outputs:
    pd.set_option('display.max_columns', None)
    state_modelfit_summary = {}
    states = df['State'].unique() # As you only need to loop once for each state
    for st in states:
        dfME = df[(df['State'] == st)]     
        model = smf.ols(Life Expectancy ~ Race + Age + Weight + C(Pets), data = dfME) 
        modelfit = model.fit()
        # Store output in a dictionary with state name as key
        state_modelfit_summary[st] = modelfit.summary