Search code examples
pythonpandaslogistic-regression

Pandas Logistic Regression mixed type not supported?


I'm working on making a logistic regression with a simple dataset in Python: Simple dataset with 5 rows

My goal is to predict whether or not someone survived. After cleaning the dataset & getting rid of NaN values as well as String columns, I've used the following code to turn every column data type to float64(cleaned dataset shown below as well): Dataset cleaned and values turned to float64

titanic_data['Survived'] = titanic_data['Survived'].astype(float)
titanic_data['Sibling/Spouse'] = titanic_data['Sibling/Spouse'].astype(float)
titanic_data['Parents/Children'] = titanic_data['Parents/Children'].astype(float)
titanic_data['male'] = titanic_data['male'].astype(float)
titanic_data['Q'] = titanic_data['Q'].astype(float)
titanic_data['S'] = titanic_data['S'].astype(float)
titanic_data[2] = titanic_data[2].astype(float)
titanic_data[3] = titanic_data[3].astype(float)

Output of the above code:

Survived            float64
Age                 float64
Sibling/Spouse      float64
Parents/Children    float64
Fare                float64
male                float64
Q                   float64
S                   float64
2                   float64
3                   float64
dtype: object

When I run my Logistic Regression code(shown below), I get the error mixed type of string and non-string is not supported.

My Regression code:

# Logistic regression
# Split the dataset

x = titanic_data.drop("Survived",axis=1)
y = titanic_data["Survived"]

from sklearn.model_selection import train_test_split
x_train, y_train, x_test, y_test = train_test_split(x,y,test_size=0.3,random_state=1)

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(x_train, y_train)

But as you can see, I've changed my column data types to all be the same, so why am I getting this error & what can I do to fix it?

EDIT: The error message I got: Error message


Solution

  • The error you are seeing is not about the column content, but about column names. Beware of naming columns with non-strings (e.g. 0/1/2/3 for quantile markers or one-hot-encoded levels). Sklearn's sanity checks expect that column names are strings. For safety, try

    X.columns = X.columns.astype(str)
    

    To avoid such problems (rather than fixing afterwards), use more canonical ways to manipulate and encode data, like pd.get_dummies or others. Here is a fully working example:

    
    # Fetch Titanic
    
    from sklearn.datasets import fetch_openml
    X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True,)
    dropped_cols = ['boat', 'body', 'home.dest', 'name', 'cabin', 'embarked', 'ticket']
    X.drop(dropped_cols, axis=1, inplace=True)
    
    # Encode (one-hot for categories) & inpute (naive)
    
    import pandas as pd
    X = pd.get_dummies(X,columns=['sex', 'pclass'], drop_first=True)
    y = y.astype(float)
    X = X.fillna(0)
    
    # Logistic regression
    
    from sklearn.linear_model import LogisticRegression
    logreg = LogisticRegression()
    logreg.fit(X, y)
    logreg.score(X,y) # 0.7868601986249045
    

    Here get_dummies method did one-hot-encoding naming columns with prefixes, hence maintaining the proper string type. X.columns looks as below:

    Index(['age', 'sibsp', 'parch', 'fare', 'sex_male', 'pclass_2.0',
           'pclass_3.0'],
          dtype='object')