Search code examples
machine-learningclassificationlogistic-regressionprediction

Logistic Regression prediction faults


I have been trying to solve this problem of titanic survived problem. Where i splitted x to be the passengers and y to be the survived. But the problem is i couldn't able to get the y_pred (ie) prediction results. As it is 0 for all the values. I get 0 value as prediction. It would be helpful for me if anyone can solve it. As it is my first classifier problem as a beginner

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


df = pd.read_csv('C:/Users/Umer/train.csv')
x = df['PassengerId'].values.reshape(-1,1)
y = df['Survived']


from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25, 
random_state = 0)


from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(x_train,y_train)

#predicting the test set results


y_pred = classifier.predict(x_test)

Solution

  • I couldn't reproduce the same result, in fact, I copied-pasted your code and did not get them all zeros as you described the issue as, instead I got:

    [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0]
    

    Nevertheless, there are a few things I noticed in your approach that you may want to know about:

    1. The default separator in Pandas read_csv is , , so if your dataset variables separated by a tab (same like the one I have) , you then should specify the separator like this:

      df = pd.read_csv('titanic.csv', sep='\t')
      
    2. PassengerId has no useful information that your model may learn from in order to predict the Survived people, it's just a continuous number that increments with each new passenger. Generally speaking, in classification, you need to avail of all features that make your model learns from (unless of course there are redundant features that add no information to the model) especially in your dataset, it's a multivariate dataset.

    3. There is no point of scaling the PassengerId, because features scaling is usually used when features highly vary in magnitudes, units and range (e.g. 5kg and 5000gms) and in your case, as I mentioned, it's just an incremental integer which has no real information to the model.

    4. One last thing, you should get your data as type float for StandardScaler to avoid warnings like the follow:

      DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
      

      So you do convert like this from the beginning:

      x = df['PassengerId'].values.astype(float).reshape(-1,1)
      

    Finally if you're still getting the same result, then please add a link to your dataset.


    Update

    After providing the dataset, it turns out that the result you're getting is correct, that's again because of reason number 2 I mentioned above (that is PassengerId provides no useful information to the model so it cannot predict correctly!)

    You can test it yourself via comparing the log loss before and after adding more features from the dataset:

    from sklearn.metrics import log_loss
    df = pd.read_csv('train.csv', sep=',')
    x = df['PassengerId'].values.reshape(-1,1)
    y = df['Survived']
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,
    random_state = 0)
    classifier = LogisticRegression()
    classifier.fit(x_train,y_train)
    y_pred_train = classifier.predict(x_train)
    # calculate and print the loss function using only the PassengerId
    print(log_loss(y_train, y_pred_train))
    #predicting the test set results
    y_pred = classifier.predict(x_test)
    print(y_pred)
    

    Output

    13.33982681120802
    [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
     0]
    

    Now by using many "supposedly-useful" information:

    from sklearn.metrics import log_loss
    df = pd.read_csv('train.csv', sep=',')
    # denote the words female and male as 0 and 1
    df['Sex'].replace(['female','male'], [0,1], inplace=True)
    # try three features that you think they are informative to the model
    # so it can learn from them
    x = df[['Fare', 'Pclass', 'Sex']].values.reshape(-1,3)
    y = df['Survived']
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,
    random_state = 0)
    classifier = LogisticRegression()
    classifier.fit(x_train,y_train)
    y_pred_train = classifier.predict(x_train)
    # calculate and print the loss function with the above 3 features
    print(log_loss(y_train, y_pred_train))
    #predicting the test set results
    y_pred = classifier.predict(x_test)
    print(y_pred)
    

    Output

    7.238735137632405
    [0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0
     0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 1 1 1 0 0 0
     0 1 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0
     1 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1
     1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0
     0 1 0 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1
     1]
    

    In Conclusion:

    As you can see, the loss gave better value (lesser than before) and the prediction is now more reasonable!