Search code examples
pythonlogistic-regressionstatsmodels

how to predict using statsmodels.formula.api logit


I have the following problem. I would like to do an in-sample prediction using logit from statsmodels.formula.api.

See my code:

import statsmodels.formula.api as smf

model_logit = smf.logit(formula="dep ~ var1 + var2 + var3", data=model_data)

Until now everything's fine. But I would like to do in-sample prediction using my model:

yhat5 = model5_logit.predict(params=["dep", "var1", "var2", "var3"])

Which gives an error ValueError: data type must provide an itemsize.

When I try:

yhat5 = model5_logit.predict(params="dep ~ var1 + var2 + var3")

I got another error: numpy.core._exceptions._UFuncNoLoopError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U69')) -> None

How can I do in-sample forecast for the Logit model using from statsmodels.formula.api?

This did not help me: How to predict new values using statsmodels.formula.api (python)


Solution

  • Using an example dataset:

    import statsmodels.formula.api as smf
    import pandas as pd
    import numpy as np
    from sklearn.datasets import make_classification
    
    X,y = make_classification(n_features=3,n_informative=2,n_redundant=1)
    model_data = pd.DataFrame(X,columns = ['var1','var2','var3'])
    model_data['dep'] = y
    

    Fit the model (which I don't see in your code):

    import statsmodels.formula.api as smf
    model_logit = smf.logit(formula="dep ~ var1 + var2 + var3", data=model_data)
    res = model_logit.fit()
    

    You can get the in sample predictions (in probabilities) and the predicted label :

    in_sample = pd.DataFrame({'prob':res.predict()})
    in_sample['pred_label'] = (in_sample['prob']>0.5).astype(int)
    
    in_sample.head()
     
           prob  pred_label
    0  0.005401           0
    1  0.911056           1
    2  0.990406           1
    3  0.412332           0
    4  0.983642           1
    

    And we check against the actual label :

    pd.crosstab(in_sample['pred_label'],model_data['dep'])
     
    dep          0   1
    pred_label        
    0           46   2
    1            4  48