Search code examples
pythonpandaslogistic-regressionstatsmodels

Is it necessary to add a constant to a logit model run on categorical variables only?


I have a dataframe that looks like this:

And am running a logit model on fluid as dependent variable, and excluding vp and perip:

model = smf.logit('''fluid ~ C(examq3_n, Treatment(reference = 2.0)) + C(pmhq3_n) + C(fluidq3_n) + C(mapq3_n, Treatment(reference = 3.0)) + 
                  C(examq6_n, Treatment(reference = 2.0)) + C(pmhq6_n) + C(fluidq6_n) + C(mapq6_n, Treatment(reference = 3.0)) +
                  + C(case, Treatment(reference = 2))''',
                      data = case1_2_vars).fit()
print(model.summary())

I get the following results:

I am wondering if I need to add a constant to the data and if so, how? I've tried adding a column to the dataframe called const which equals 1, but when I then add const to the logit equation I get LinAlgError: Singular Matrix, and I don't know how to add it using smf.add_constant() because I have had to specify the categorical variables and their respective reference numbers in the equation, rather than defining x and y separately and simply inputting those into the smf.logit() call.

My questions are: a) do I need to add a constant, and b) how? There are some links online that seem to imply it might not be necessary for a categorical variable-based logit model, but I would rather do it if it's best practice.

I'm also wondering, does statsmodels automatically include a constant? Because Intercept is listed in the results.


Solution

  • If you use formulas, then the formula handling by patsy adds automatically a constant/intercept.
    (when using e.g. smf.logit or sm.Logit.from_formula)

    If you create a model without formula using numpy arrays or pandas DataFrame, then the exog is not changed by statsmodels, i.e. users needs to add a constant themselves. The helper function is sm.add_constant which adds a column of ones to the array or DataFrame.
    (when using e.g. sm.Logit(y, x))