Search code examples
pythonregressionstatsmodelspatsy

Using Variable instead of column name in Statsmodel formula API


I have a variable cols that contain list of column name for my table.

Now I want to run an regression on my table by looping through different columns of cols variable.

I am trying to use Statsmodel Formula API (Patsy) but am unable to construct a proper formula

The code that I am trying right now is:

model = smf.ols(formula="Annual_Sales ~ Q('cols')", data=df).fit()

But this obviously is not working as cols is not present in my df table.

Any suggestion how can I do this, preferably by for loop as I have 150 columns and I can't manually enter all those names in formula.

Thank You


Solution

  • One way I was able to solve this problem was using String Formatting, as generally the formula written inside Statsmodel is in String format.

    So if we have,

    col = ["a", "b", "c", "d"]
    

    We can write,

    for i in range(0, len(col) - 1):
         for j in range(i + 1, len(col)):
           model = smf.ols(formula="Annual_Sales ~ Q('{}') + ('{}')".format(col[i], col[j]), data=df).fit()
    

    This will allow us to loop through the list variable col, while taking two factors at a time to create the model.