Search code examples

Sklearn ColumnTransformer + Pipeline = TypeError

I am trying to use properly pipelines and column transformers from sklearn but always end up with an error. I reproduced it in the following example.

# Data to reproduce the error
X = pd.DataFrame([[1,  2 , 3,  1 ],
                  [1, '?', 2,  0 ],
                  [4,  5 , 6, '?']],
                 columns=['A', 'B', 'C', 'D'])

#SimpleImputer to change the values '?' with the mode
impute = SimpleImputer(missing_values='?', strategy='most_frequent')

#Simple one hot encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

col_transfo = ColumnTransformer(transformers=[
    ('missing_vals', impute, ['B', 'D']),
    ('one_hot', ohe, ['A', 'B'])],

Then calling the transformer as follows:


Returns the following error:

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']


  • ColumnTransformer applies its transformers in parallel, not in sequence. So the OneHotEncoder sees the un-imputed column B and balks at the mixed types.

    In your case, it's probably fine to just impute on all the columns, and then encode A, B:

    encoder = ColumnTransformer(transformers=[
        ('one_hot', ohe, ['A', 'B'])],
    preproc = Pipeline(steps=[
        ('impute', impute),
        ('encode', encoder),
        # optionally, just throw the model here...

    If it's important that future missing values in A,C cause errors, then similarly wrap impute into its own ColumnTransformer.

    See also Apply multiple preprocessing steps to a column in sklearn pipeline