Search code examples
pythonstatisticsstatsmodels

Statsmodels Logistic Regression class imbalance


I'd like to run a logistic regression on a dataset with 0.5% positive class by re-balancing the dataset through class or sample weights. I can do this in scikit learn, but it doesn't provide any of the inferential stats for the model (confidence intervals, p-values, residual analysis).

Is this possible to do in statsmodels? I don't see a sample_weights or class_weights argument in statsmodels.discrete.discrete_model.Logit.fit

Thank you!


Solution

  • programmer's answer:

    statsmodels Logit and other discrete models don't have weights yet. (*)

    GLM Binomial has implicitly defined case weights through the number of successful and unsuccessful trials per observation. It would also allow manipulating the weights through the GLM variance function, but that is not officially supported and tested yet.

    update statsmodels Logit still does not have weights, but GLM has obtained var_weights and freq_weights several statsmodels releases ago. GLM Binomial can be used to estimate a Logit or a Probit model.

    statistician's/econometrician's answer:

    Inference, standard errors, confidence intervals, tests and so on, are based on having a random sample. If weights are manipulated, then this should affect the inferential statistics. However, I never looked at the problem for rebalancing the data based on the observed response. In general, this creates a selection bias. A quick internet search shows several answers, from rebalancing doesn't have a positive effect in Logit to penalized estimation as alternative.

    One possibility is to also try different link function, cloglog or other link functions have asymmetric or heavier tails that are more appropriate for data with small risk in one class or category.

    (*) One problem with implementing weights is to decide what their interpretation is for inference. Stata, for example, allows for 3 kinds of weights.