Search code examples
pythonmachine-learningscikit-learncalculated-columns

ColumnTransformer produces different results


I'm going through the jupyter notebooks from the book Hand-On ML with scikit-learn. I'm trying to do the Titanic challenge but using the ColumnTransformer.

I'm trying to create the pre-processing pipeline, for numerical values the ColumnTransformer produces the right output. However, when working with the categorical values I'm getting a weird output.

Here's the code:

from sklearn.preprocessing import OneHotEncoder

cat_attr = ['Sex', 'Embarked', 'Pclass']

cat_pipeline = ColumnTransformer([
        ('cat_fill_missing', SimpleImputer(strategy='most_frequent'), cat_attr),
        ('cat_encoder', OneHotEncoder(sparse=False), cat_attr),
        ])

cat_pipeline.fit_transform(train_data)

This produces:

array([['male', 'S', 3, ..., 0.0, 0.0, 1.0],
       ['female', 'C', 1, ..., 1.0, 0.0, 0.0],
       ['female', 'S', 3, ..., 0.0, 0.0, 1.0],
       ...,
       ['female', 'S', 3, ..., 0.0, 0.0, 1.0],
       ['male', 'C', 1, ..., 1.0, 0.0, 0.0],
       ['male', 'Q', 3, ..., 0.0, 0.0, 1.0]], dtype=object)

However, if I run the Imputer and OneHotEncoder one by one:

imputer = SimpleImputer(strategy='most_frequent')

filled_df = imputer.fit_transform(train_data[cat_attr])

onehot = OneHotEncoder(sparse=False)

onehot.fit_transform(filled_df)

I get the correct encoding:

array([[0., 1., 0., ..., 0., 0., 1.],
       [1., 0., 1., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 0., 1.],
       ...,
       [1., 0., 0., ..., 0., 0., 1.],
       [0., 1., 1., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 1.]])

What's the reason behind this behaviour? I thought ColumnTransformer modified each column one by one.


Solution

  • Expanding on the comment, the columns you give as the third tuple elements to ColumnTransformer should partition the entire set of columns in your dataframe.

    If some columns are repeated, as you have experienced, this messes up the results. If some columns are omitted, they are left out from the output of ColumnTransformer.

    For example, say that your dataframe has categorical columns cat_attr and numeric columns num_attr. You want to apply two transformations (SimpleImputer and OneHotEncoder) to the categorical columns and no transformation to the numeric columns. In this case the correct approach is:

    transformer = ColumnTransformer(transformers=[
        ('categorical', make_pipeline(
            SimpleImputer(strategy='most_frequent'),
            OneHotEncoder(sparse=False)
        ), cat_attr),
        ('numerical', 'passthrough', num_attr)
    ])
    

    Creating a pipeline for all the transformations which act on the same set of columns and using the special string 'passthrough' to indicate that another set of columns should not be transformed.