Search code examples
pythonscikit-learnclassificationlogistic-regression

Prediction based on multiple binary inputs


Assume we have following DataFrame, where A, B, C, and D are the binary outcome of a classification task. "1" relates to "finished", "0" relates to "not finished".

A B C D True
0 1 1 1 1
1 0 0 0 0
1 1 1 1 1
1 1 1 1 1
0 1 1 1 1
0 0 0 0 0 
1 1 1 1 1
0 1 0 0 1
0 1 1 1 1
1 1 1 1 1
0 1 0 0 0

I wonder how possible it is to predict the True outcome, dependent on the values in A, B, C, D.

Shall I apply a multivariate logistic regression with scikit learn?


Solution

  • You could use sklearn's LogisticRegression:

    from sklearn.linear_model import LogisticRegression
    
    endog = data.TRUE.values
    exog = data.drop('TRUE', axis=1).values
    model = LogisticRegression()
    model.fit(exog, endog)
    
    model.score(exog, endog)  # mean accuracy
    # 0.90909090909090906
    
    model.predict(exog)       # your predicted values
    # array([1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1], dtype=int64)
    

    Keep in mind in this example you are training a statistical model and then trying to predict based on the (in-sample) data you've already fed the model. That is generally regarded as shabby statistical practice, so proceed with caution or test on out-of-sample data.