Search code examples
pythonlogistic-regressionstatsmodels

Computer freezing whenever I use fitted model summary


I'm doing a logistic regression for a school project. After I train the data and try to perform the fitted model summary, specifically

fitted_model = linear_regression.fit()

just runs and runs and runs, to the point that the browser freezes. I've let it run for twentyish minutes. The data isn't that large in my opinion as it's only 10,000 rows of data(again, I'm a student, so please correct me if I'm wrong). That said, I used getdummies on many columns and currently have 18,000 columns. However, I saw this thread and if I'm reading it correctly, it should run ok still (How many features can scikit-learn handle?) ] Any advice before I just start from scratch?

Below is more of the code, just in case it's helpful.

from sklearn.linear_model import LogisticRegression

logreg=LogisticRegression()
logreg.fit(X_train, y_train)

y_hat_train=logreg.predict(X_train)
y_hat_test = logreg.predict(X_test)

from sklearn.metrics import classification_report
print(classification_report(y_test,y_hat_test))

import statsmodels.api as sm
import statsmodels.formula.api as smf
Xc = sm.add_constant(X_train)
linear_regression = sm.OLS(y_train,Xc)
fitted_model = linear_regression.fit()
fitted_model.summary()

Solution

  • Try to reduce the number of samples you have and see if the code you have still runs, and how much time does it take. Putting this before your code snippet should work:

    n_train = 100
    n_test = 10
    X_train = X_train[:n_train]
    y_train = y_train[:n_train]
    X_test = X_test[:n_test]
    y_test = y_test[:n_test]
    

    As a side note, it sounds like you just have too much data. >20 minutes for 10000 data points can be reasonable depending on the number of features per data point. If you have 1 feature per data point, then you have 10000 numbers only, which isn't a lot; but if you have 1000 features per point, then you have 10000*1000 data points - which is a whole other story.