Search code examples
pythonscikit-learnpipeline

ValueError: Input contains NaN, even when Using SimpleImputer


I'm trying to Work on the Titanic Dataset as my first Kaggle Project and I ran into this error. I kept searching for a solution here on Stack but i still can't figure it out.

I made the two Pipelines to preprocess the numerical and categorical features:

num_pipeline = Pipeline([
            ('imputer', SimpleImputer( strategy='median')), 
            ('scaler', StandardScaler())])
    
cat_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder()) ])

and then i've joined them into a ColumnTransformer

preprocessor = ColumnTransformer(
        transformers = [
            ('num', num_pipeline, numeric_features),
            ('cat', cat_pipeline, categorical_features) ])

numeric_features and categorical_features being the list of numerical and categorical features:

numeric_features = ['Age', 'SibSp', 'Parch', 'Fare']
categorical_features = ['Pclass', 'Sex',  'Embarked']

Finally, in my final Pipeline I add a Classifier:

knn = Pipeline([
    ('Preprocessor' , preprocessor),
    ('Classifier', KNeighborsClassifier())
])
knn.fit(X_train, y_train)

Here is when I get the "ValueError: Input contains NaN"


Solution

  • train = pd.read_csv('train.csv')
    train.isna().sum()
    

    Output:

    PassengerId      0
    Survived         0
    Pclass           0
    Name             0
    Sex              0
    Age            177
    SibSp            0
    Parch            0
    Ticket           0
    Fare             0
    Cabin          687
    Embarked         2
    dtype: int64
    

    The columns Age, Cabin and Embarked contain NaN values. However, you do not include the Cabin column in numeric_features or categorical_features, so it's values do not get imputed. This is why you get the error.