Search code examples
pythonlogistic-regressionstatsmodels

Statsmodels' Logit.fit_regularized keeps running forever


Lately I've been trying to fit a Regularized Logistic Regression on vectorized text data. I first tried with sklearn, and had no problem, but then I discovered and I can't do inference through sklearn, so I tried to switch to statsmodels. The problem is, when I try to fit the logit it keeps running forever and using about 95% of my RAM (tried both on 8GB and 16GB RAM computers).

My first guess was it had to do with dimensionality, because I was working with a 2960 x 43k matrix. So, to reduce it, I deleted bigrams and took a sample of only 100 observations, which leaves me with a 100 x 6984 matrix, which, I think, shouldn't be too problematic.

This is a little sample of my code:

for train_index, test_index in sss.split(df_modelo.Cuerpo, df_modelo.Dummy_genero):

   X_train, X_test = df_modelo.Cuerpo[train_index], df_modelo.Cuerpo[test_index]
   y_train, y_test = df_modelo.Dummy_genero[train_index], df_modelo.Dummy_genero[test_index]


cvectorizer=CountVectorizer(max_df=0.97, min_df=3, ngram_range=(1,1) )
vec=cvectorizer.fit(X_train)
X_train_vectorized = vec.transform(X_train)

This gets me a train and a test set, and then vectorizes text from X_train. Then I try:

import statsmodels.api as sm

logit=sm.Logit(y_train.values,X_train_vectorized.todense())
result=logit.fit_regularized(method='l1')

Everything works fine until the result line, which keeps running forever. Is there something I can do? Should I switch to R if I'm looking for statistical inference?

Thanks in advance!


Solution

  • Almost all of statsmodels and all the inference is designed for the case when the number of observations is much larger than the number of features.

    Logit.fit_regularized uses an interior point algorithm with scipy optimizers which needs to keep all features in memory. Inference for the parameters requires the covariance of the parameter estimate which has shape n_features by n_features. The use case for which it was designed is when the number of features is relatively small compared to the number of observations, and the Hessian can be used in-memory.

    GLM.fit_regularized estimates elastic net penalized parameters and uses coordinate descend. This can possibly handle a large number of features, but it does not have any inferential results available.

    Inference after Lasso and similar penalization that select variables has only been available in recent research. See for example selective inference in Python https://github.com/selective-inference/Python-software for which also a R package is available.