Search code examples
pythonmachine-learningscikit-learnlogistic-regression

Predict if a number is odd or even using Logistic Regression formula y = x % 2 + 0


Given array of numbers from 1-20 (X_train) and array of binary values from 0 or 1 (y_train) passing it to Logistic Regression algorithm and then training the model. Trying to predict with below X_test gives me incorrect data.

Created the sample train and test data as shown below. Please suggest what's wrong with the code.

import numpy as np

X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype=float).reshape(-1, 1)
y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0], dtype=float)
X_test = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 55, 88, 99, 100], dtype=float).reshape(-1, 1)

from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)
y_predict = logreg.predict(X_test)
print(y_predict)

Output :
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0.]

Solution

  • A similar question was already asked here. I would like to use this post as inspiration for my solution.

    But first let me mention two things:

    1. A logistic regression is very beneficial in terms of time, performance and explainability if you have some kind of nested-linear relationships between your feature(s) and label, but obviously that is not the case for your example. You want to estimate a discontinous function that equals one if your input is odd and zero otherwise, which is not easily achieved.

    2. Your data representation is not good. I think this point is more critical for your prediction goal as a better data representation does lead to a better prediction.

    Next, I would like to share an alternative data representation. This new representation does yield perfect prediction results, even for a simple untuned logistic regression.

    Code:

    import numpy as np
    from sklearn import linear_model
    
    X_train = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
    y_train = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0], dtype=float)
    X_test = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 55, 88, 99, 100])
    
    def convert_repr(x):
        return list(map(int, list(str(format(x, '016b')))))
    
    # Change data representation
    X_train = np.array(list(map(convert_repr, X_train)))
    X_test = np.array(list(map(convert_repr, X_test)))
    
    logreg = linear_model.LogisticRegression()
    logreg.fit(X_train, y_train)
    y_predict = logreg.predict(X_test)
    print(y_predict)
    

    Output:

    [1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0.]
    

    As you can see, the data is more important than the actual model.