Given array of numbers from 1-20 (X_train) and array of binary values from 0 or 1 (y_train) passing it to Logistic Regression algorithm and then training the model. Trying to predict with below X_test gives me incorrect data.
Created the sample train and test data as shown below. Please suggest what's wrong with the code.
import numpy as np
X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype=float).reshape(-1, 1)
y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0], dtype=float)
X_test = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 55, 88, 99, 100], dtype=float).reshape(-1, 1)
from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)
y_predict = logreg.predict(X_test)
print(y_predict)
Output :
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0.]
A similar question was already asked here. I would like to use this post as inspiration for my solution.
But first let me mention two things:
A logistic regression is very beneficial in terms of time, performance and explainability if you have some kind of nested-linear relationships between your feature(s) and label, but obviously that is not the case for your example. You want to estimate a discontinous function that equals one if your input is odd and zero otherwise, which is not easily achieved.
Your data representation is not good. I think this point is more critical for your prediction goal as a better data representation does lead to a better prediction.
Next, I would like to share an alternative data representation. This new representation does yield perfect prediction results, even for a simple untuned logistic regression.
Code:
import numpy as np
from sklearn import linear_model
X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0], dtype=float)
X_test = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 55, 88, 99, 100])
def convert_repr(x):
return list(map(int, list(str(format(x, '016b')))))
# Change data representation
X_train = np.array(list(map(convert_repr, X_train)))
X_test = np.array(list(map(convert_repr, X_test)))
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)
y_predict = logreg.predict(X_test)
print(y_predict)
Output:
[1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0.]
As you can see, the data is more important than the actual model.