Search code examples
pythonmachine-learningscikit-learnlogistic-regressionkaggle

How to use logistic regression on test data


I am using Logistic Regression on my Titanic model and PyCharm is asking me to pass DataFrames with bool values only:

Traceback (most recent call last):
  File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 29, in <module>
    predictions = logReg.predict(test[test_data])
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 2914, in __getitem__
    return self._getitem_frame(key)
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 3009, in _getitem_frame
    raise ValueError('Must pass DataFrame with boolean values only')
ValueError: Must pass DataFrame with boolean values only

I don't understand why because the exact same features were used on Logistic Regression while training the model and it was well received then. Here is my code (ignore the code repetition. That's a problem I'm going to tackle after):

import pandas as pd
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

warnings.filterwarnings("ignore", category=FutureWarning)

train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")

train['Sex'] = train['Sex'].replace(['female', 'male'], [0, 1])
train['Embarked'] = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])
train['Age'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)
train['HasCabin'] = train['Cabin'].notnull().astype(int)
train['Relatives'] = train['SibSp'] + train['Parch']
train_data = train[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
x_train, x_validate, y_train, y_validate = train_test_split(train_data, train['Survived'], test_size=0.22, random_state=0)

test['Sex'] = test['Sex'].replace(['female', 'male'], [0, 1])
test['Embarked'] = test['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])
test['Age'].fillna(test.groupby('Sex')['Age'].transform("median"), inplace=True)
test['HasCabin'] = test['Cabin'].notnull().astype(int)
test['Relatives'] = test['SibSp'] + test['Parch']
test_data = test[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train)

predictions = logReg.predict(test[test_data])
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': predictions})

filename = 'Titanic-Submission.csv'
submission.to_csv(filename, index=False)

Specifically, Python takes issue with this snippet:

test_data = test[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]

...

predictions = logReg.predict(test[test_data])

UPDATE

I've changed my predictions variable to this:

predictions = logReg.predict(test_data)

And now this is my stacktrace:

Traceback (most recent call last):
  File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 29, in <module>
    predictions = logReg.predict(test_data)
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\linear_model\base.py", line 281, in predict
    scores = self.decision_function(X)
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\linear_model\base.py", line 257, in decision_function
    X = check_array(X, accept_sparse='csr')
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\utils\validation.py", line 573, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\utils\validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Which means that my feature selection/engineering for the test data does not go through


Solution

  • Predictions with x_validate work no problem. Try:

    >>> predictions = logReg.predict(x_validate)
    

    So there must be something wrong with test_data. Get some information on the dataframes and compare:

    >>> x_validate.info(verbose=True)                                                                                                                                                          
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 197 entries, 495 to 45
    Data columns (total 7 columns):
    Pclass       197 non-null int64
    Sex          197 non-null int64
    Relatives    197 non-null int64
    Fare         197 non-null float64
    Age          197 non-null float64
    Embarked     197 non-null int64
    HasCabin     197 non-null int64
    dtypes: float64(2), int64(5)
    memory usage: 12.3 KB
    
    >>> test_data.info(verbose=True)                                                                                                                                                           
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 418 entries, 0 to 417
    Data columns (total 7 columns):
    Pclass       418 non-null int64
    Sex          418 non-null int64
    Relatives    418 non-null int64
    Fare         417 non-null float64
    Age          418 non-null float64
    Embarked     418 non-null int64
    HasCabin     418 non-null int64
    dtypes: float64(2), int64(5)
    memory usage: 22.9 KB
    

    Looks like there's a NaN here:

    Fare         417 non-null float64