Search code examples
scikit-learndata-sciencelogistic-regressionstatsmodels

Statsmodel skips a value in a logistic regression?


I built a multinomial regression with scikit_learn, it worked fine. I then tried to use the same data with statsmodel as it provides more insight, and it seems to skip the first y value. Any ideas on what I may have done wrong ?

I have 6 variables in_X and 7 possible outcome in_y (from y=1 to y=7), but statsmodel returns only 6 coefficients.

When I print print(result.summary()) the log starts at y=2

Here's the data shape:

in_y.value_counts()
>>>
3    295
4    154
5    125
2     86
6     28
1      5
7      3
Name: y, dtype: int64

in_X.head()
>>>
    ENTERPRISE_VALUE_   SALES_GROWTH_   EBIT_TO_INT_EXP_    NET_DEBT_TO_EBITDA_ RETURN_COM_EQY_ CASH_RATIO_
918 4.0 4.0 4.0 4.0 5.0 4.0
344 6.0 3.0 4.0 4.0 4.0 6.0
348 5.0 3.0 3.0 5.0 3.0 6.0
906 4.0 5.0 4.0 4.0 4.0 4.0
80  3.0 4.0 4.0 4.0 4.0 4.0

(696, 6)

The code:

import pandas as pd
import statsmodels.discrete.discrete_model as sm

    logit_model = sm.MNLogit(in_y, in_X)
    result = logit_model.fit()

    # Results analysis
    print(result.summary())
    out1 = result.params

out1

0   1   2   3   4   5
ENTERPRISE_VALUE_   -0.228684   -1.274831   -2.546053   -3.440249   -3.602911   -3.822631
SALES_GROWTH_   0.553498    0.706551    1.399920    1.675287    1.646694    1.152329
EBIT_TO_INT_EXP_    -0.036777   -0.304586   -0.895444   -1.351096   -1.614823   -0.593286
NET_DEBT_TO_EBITDA_ 0.772482    1.690700    2.106280    2.881484    3.524116    4.281756
RETURN_COM_EQY_ -0.053659   0.269994    0.487565    0.653377    0.228949    -1.413008
CASH_RATIO_ -0.035479   0.399930    0.808460    0.722607    0.263178    -0.502091

Result summary:

Logit Regression Results                          
==============================================================================
Dep. Variable:                      y   No. Observations:                  696
Model:                        MNLogit   Df Residuals:                      660
Method:                           MLE   Df Model:                           30
Date:                Mon, 01 Oct 2018   Pseudo R-squ.:                  0.2390
Time:                        12:09:15   Log-Likelihood:                -769.38
converged:                       True   LL-Null:                       -1011.0
                                        LLR p-value:                 3.400e-83
=======================================================================================
                y=2       coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
[...]

Solution

  • We need to drop one of the categories as reference category because of the restriction that probabilities have to add to 1. So given the other parameters the probability for the reference category is just the one minus the some of the non-reference probabilities.

    This is the same as for the Logit model where we can estimate only one set of parameters, e.g. for the probability of success, the probability of the second binary choice, e.g. the probability to fail is just one minus the probability of success.

    In both cases the prediction of the response variable will be a binary or multinomial probability that needs to satisfy restrictions for probabilities, i.e values between zero and one and adding to one.