python scikit-learn classification logistic-regression

Prediction based on multiple binary inputs

Assume we have following DataFrame, where A, B, C, and D are the binary outcome of a classification task. "1" relates to "finished", "0" relates to "not finished".

A B C D True
0 1 1 1 1
1 0 0 0 0
1 1 1 1 1
1 1 1 1 1
0 1 1 1 1
0 0 0 0 0 
1 1 1 1 1
0 1 0 0 1
0 1 1 1 1
1 1 1 1 1
0 1 0 0 0

I wonder how possible it is to predict the True outcome, dependent on the values in A, B, C, D.

Shall I apply a multivariate logistic regression with scikit learn?

Solution

You could use sklearn's LogisticRegression:

from sklearn.linear_model import LogisticRegression

endog = data.TRUE.values
exog = data.drop('TRUE', axis=1).values
model = LogisticRegression()
model.fit(exog, endog)

model.score(exog, endog)  # mean accuracy
# 0.90909090909090906

model.predict(exog)       # your predicted values
# array([1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1], dtype=int64)

Keep in mind in this example you are training a statistical model and then trying to predict based on the (in-sample) data you've already fed the model. That is generally regarded as shabby statistical practice, so proceed with caution or test on out-of-sample data.