Search code examples
pythonmachine-learningdeep-learninglogistic-regressiongradient-descent

Logistic regression from scratch: error keeps increasing


I have implemented logistic regression from scratch, however when I run the script the algorithm always predict the wrong label. I've tried changing the training output and test_output by switching all 1 to 0 and vice versa but it always predict the wrong label.
I also noticed that changing the "-" sign to "+", when updating the weigths and the bias, the script correctly predicts the label.
What am I doing wrong?
This is the code I've written:

# IMPORTS
import numpy as np

# HYPERPARAMETERS
EPOCHS = 1000
LEARNING_RATE = 0.1

# FUNCTIONS
def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def cost(y_pred, training_outputs, m):
    j = - np.sum(training_outputs * np.log(y_pred) + (1 - training_outputs) * np.log(1 - y_pred)) / m
    return j


# ENTRY
if __name__ == "__main__":
    
    # Training input and output
    x = np.array([[1, 1, 1], [0, 0, 0], [1, 0, 1]])
    training_outputs = np.array([1, 0, 1])

    # Test input and output
    test_input = np.array([[0, 1, 1]])
    test_output = np.array([0])

    # Weigths
    w = np.array([0.3, 0.3, 0.3])

    # Biases
    b = 0

    m = 3

    # Training
    for iteration in range(EPOCHS):
        print("Iteration n.", iteration, end= "\r")
        
        # Compute log odds
        z = np.dot(x, w) + b

        # Compute predicted probability
        y_pred = sigmoid(z)

        # Back propagation
        dz = y_pred - training_outputs
        dw = np.dot(x, dz) / m
        db = np.sum(dz) / m

        # Update weights and bias according to the gradient descent algorithm
        w = w - LEARNING_RATE * dw
        b = b - LEARNING_RATE * db

    print("Model trained. Proceeding with model evaluation...")

    # Test
    # Compute log odds
    z = np.dot(test_input, w) + b

    # Compute predicted probability
    y_pred = sigmoid(z)
    print(y_pred)
    
    # Compute cost
    cost = cost(y_pred, test_output, m)

    print(cost)

Solution

  • There was an incorrect assumption pointed out by @J_H:

    >>> from sklearn.linear_model import LogisticRegression
    >>> import numpy as np
    >>> x = np.array([[1, 1, 1], [0, 0, 0], [1, 0, 1]])
    >>> y = np.array([1, 0, 1])
    >>> clf = LogisticRegression().fit(x, y)
    >>> clf.predict([[0, 1, 1]])
    array([1])
    

    scikit-learn at appears to believe that test_output should be a 1 rather than a 0.

    A few more recommendations:

    • m should be fine to remove (it's a constant, so it could be included in the LEARNING_RATE)
    • w should be initialized proportional to the number of columns in x (i.e., x.shape[1])
    • dw = np.dot(x, dz) should be np.dot(dz, x)
    • Prediction in logistic regression depends on a threshold, usually 0.5

    Taking this into account would look something like the following.

    # Initialize weights and bias
    w, b = np.zeros(X.shape[1]), 0
    
    for _ in range(EPOCHS):
        # Compute log odds
        z = np.dot(x, w) + b
    
        # Compute predicted probability
        y_pred = sigmoid(z)
    
        # Back propagation
        dz = y_pred - training_outputs
        dw = np.dot(dz, x)
        db = np.sum(dz)
    
        # Update
        w = w - LEARNING_RATE * dw
        b = b - LEARNING_RATE * db
    
    # Test
    z = np.dot(test_input, w) + b
    test_pred = sigmoid(z) >= 0.5
    print(test_pred)
    

    And a complete example on random train/test sets created with sklearn.datasets.make_classification could look like this—which usually gets within a few decimals of the scikit-learn implementation as well:

    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    import numpy as np
    
    EPOCHS = 100
    LEARNING_RATE = 0.01
    
    def sigmoid(z):
        return 1 / (1 + np.exp(-z))
    
    if __name__ == "__main__":
    
        X, y = make_classification(n_samples=1000, n_features=5)
        X_train, X_test, y_train, y_test = train_test_split(X, y)
    
        # Initialize `w` and `b`
        w, b = np.zeros(X.shape[1]), 0
    
        for _ in range(EPOCHS):
            z = np.dot(X_train, w) + b
            y_pred = sigmoid(z)
            dz = y_pred - y_train
            dw = np.dot(dz, X_train)
            db = np.sum(dz)
            w = w - LEARNING_RATE * dw
            b = b - LEARNING_RATE * db
    
        # Test
        z = np.dot(X_test, w) + b
        test_pred = sigmoid(z) >= 0.5
        print(accuracy_score(y_test, test_pred))