pandas scikit-learn preprocessor one-hot-encoding

Sklearn ColumnTransformer + Pipeline = TypeError

I am trying to use properly pipelines and column transformers from sklearn but always end up with an error. I reproduced it in the following example.

# Data to reproduce the error
X = pd.DataFrame([[1,  2 , 3,  1 ],
                  [1, '?', 2,  0 ],
                  [4,  5 , 6, '?']],
                 columns=['A', 'B', 'C', 'D'])

#SimpleImputer to change the values '?' with the mode
impute = SimpleImputer(missing_values='?', strategy='most_frequent')

#Simple one hot encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

col_transfo = ColumnTransformer(transformers=[
    ('missing_vals', impute, ['B', 'D']),
    ('one_hot', ohe, ['A', 'B'])],
    remainder='passthrough'
)

Then calling the transformer as follows:

col_transfo.fit_transform(X)

Returns the following error:

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']

Solution

ColumnTransformer applies its transformers in parallel, not in sequence. So the OneHotEncoder sees the un-imputed column B and balks at the mixed types.

In your case, it's probably fine to just impute on all the columns, and then encode A, B:

encoder = ColumnTransformer(transformers=[
    ('one_hot', ohe, ['A', 'B'])],
    remainder='passthrough'
)
preproc = Pipeline(steps=[
    ('impute', impute),
    ('encode', encoder),
    # optionally, just throw the model here...
])

If it's important that future missing values in A,C cause errors, then similarly wrap impute into its own ColumnTransformer.