I am using Logistic Regression on my Titanic model and PyCharm is asking me to pass DataFrames with bool values only:
Traceback (most recent call last):
File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 29, in <module>
predictions = logReg.predict(test[test_data])
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 2914, in __getitem__
return self._getitem_frame(key)
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 3009, in _getitem_frame
raise ValueError('Must pass DataFrame with boolean values only')
ValueError: Must pass DataFrame with boolean values only
I don't understand why because the exact same features were used on Logistic Regression while training the model and it was well received then. Here is my code (ignore the code repetition. That's a problem I'm going to tackle after):
import pandas as pd
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore", category=FutureWarning)
train = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/oo92/Titanic-Kaggle/master/test.csv")
train['Sex'] = train['Sex'].replace(['female', 'male'], [0, 1])
train['Embarked'] = train['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])
train['Age'].fillna(train.groupby('Sex')['Age'].transform("median"), inplace=True)
train['HasCabin'] = train['Cabin'].notnull().astype(int)
train['Relatives'] = train['SibSp'] + train['Parch']
train_data = train[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
x_train, x_validate, y_train, y_validate = train_test_split(train_data, train['Survived'], test_size=0.22, random_state=0)
test['Sex'] = test['Sex'].replace(['female', 'male'], [0, 1])
test['Embarked'] = test['Embarked'].replace(['C', 'Q', 'S'], [1, 2, 3])
test['Age'].fillna(test.groupby('Sex')['Age'].transform("median"), inplace=True)
test['HasCabin'] = test['Cabin'].notnull().astype(int)
test['Relatives'] = test['SibSp'] + test['Parch']
test_data = test[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
logReg = LogisticRegression()
logReg.fit(x_train, y_train)
predictions = logReg.predict(test[test_data])
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': predictions})
filename = 'Titanic-Submission.csv'
submission.to_csv(filename, index=False)
Specifically, Python takes issue with this snippet:
test_data = test[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]
...
predictions = logReg.predict(test[test_data])
UPDATE
I've changed my predictions
variable to this:
predictions = logReg.predict(test_data)
And now this is my stacktrace:
Traceback (most recent call last):
File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 29, in <module>
predictions = logReg.predict(test_data)
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\linear_model\base.py", line 281, in predict
scores = self.decision_function(X)
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\linear_model\base.py", line 257, in decision_function
X = check_array(X, accept_sparse='csr')
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\utils\validation.py", line 573, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\utils\validation.py", line 56, in _assert_all_finite
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Which means that my feature selection/engineering for the test data does not go through
Predictions with x_validate
work no problem. Try:
>>> predictions = logReg.predict(x_validate)
So there must be something wrong with test_data
. Get some information on the dataframes and compare:
>>> x_validate.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 197 entries, 495 to 45
Data columns (total 7 columns):
Pclass 197 non-null int64
Sex 197 non-null int64
Relatives 197 non-null int64
Fare 197 non-null float64
Age 197 non-null float64
Embarked 197 non-null int64
HasCabin 197 non-null int64
dtypes: float64(2), int64(5)
memory usage: 12.3 KB
>>> test_data.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass 418 non-null int64
Sex 418 non-null int64
Relatives 418 non-null int64
Fare 417 non-null float64
Age 418 non-null float64
Embarked 418 non-null int64
HasCabin 418 non-null int64
dtypes: float64(2), int64(5)
memory usage: 22.9 KB
Looks like there's a NaN here:
Fare 417 non-null float64