python scikit-learn pipeline sparse-matrix countvectorizer

Using CountVectorizer with Pipeline and ColumnTransformer and getting AttributeError: 'numpy.ndarray' object has no attribute 'lower'

I'm trying to use CountVectorizer() with Pipeline and ColumnTransformer. Because CountVectorizer() produces sparse matrix, I used FunctionTransformer to ensure the ColumnTransformer can hstack correctly when putting together the resulting matrix.

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from typing import Callable

# Dataset
df = pd.DataFrame([['a', 'Hi Tom', 'It is hot', 1],
                    ['b', 'How you been Tom', 'hot coffee', 2],
                    ['c', 'Hi you', 'I want some coffee', 3]],
                   columns=['col_for_ohe', 'col_for_countvectorizer_1', 'col_for_countvectorizer_2', 'num_col'])

# Use FunctionTransformer to ensure dense matrix
def tf_text(X, vectorizer_tf: Callable):
    X_vect_ = vectorizer_tf.fit_transform(X)
    return X_vect_.toarray()

tf_transformer = FunctionTransformer(tf_text, kw_args={'vectorizer_tf': CountVectorizer()})

# Transformation Pipelines
tf_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
             ('tf', tf_transformer)])

ohe_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
             ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])

transformer = ColumnTransformer(transformers=[
    ('cat_ohe', ohe_transformer_pipe, ['col_for_ohe']),
    ('cat_tf', tf_transformer_pipe, ['col_for_countvectorizer_1', 'col_for_countvectorizer_2'])
], remainder='passthrough')

transformed_df = transformer.fit_transform(df)

I get AttributeError: 'numpy.ndarray' object has no attribute 'lower.' I've seen this question and suspect CountVectorizer() is the culprit but not sure how to solve it (previous question doesn't use ColumnTransformer). I stumbled upon a DenseTransformer that I wish I could use instead of FunctionTransformer but unfortunately it is not supported in my company.

Solution

Imo, the first consideration to be done is that CountVectorizer() requires 1D input; your example is not working because the imputation is returning a 2D numpy array which means that you'll need to add a customized treatment to make it work.

Then you should also consider that when using a CountVectorizer() instance (which - again - requires 1D input) as transformer in a ColumnTransformer() that's how you should pass transformers' columns:

columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. [...]

This would be useful in interpreting the snippet I'll post as a possible solution.

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from typing import Callable
from sklearn.base import BaseEstimator, TransformerMixin

# Dataset
df = pd.DataFrame([['a', 'Hi Tom', 'It is hot', 1],
                ['b', 'How you been Tom', 'hot coffee', 2],
                ['c', 'Hi you', 'I want some coffee', 3]],
               columns=['col_for_ohe', 'col_for_countvectorizer_1', 'col_for_countvectorizer_2', 'num_col'])

class DimTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, *_):
        return self
    def transform(self, X, *_):
        return pd.DataFrame(X)

# Use FunctionTransformer to ensure dense matrix
def tf_text(X, vectorizer_tf: Callable):
    X_vect_ = vectorizer_tf.fit_transform(X)
    return X_vect_.toarray()

tf_transformer = FunctionTransformer(tf_text, kw_args={'vectorizer_tf': CountVectorizer()})

# Transformation Pipelines
tf_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')), 
             ('dt', DimTransformer()),
             ('ct', ColumnTransformer([
                 ('tf1', tf_transformer, 0), 
                 ('tf2', tf_transformer, 1)
             ]))    
])

ohe_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
             ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])

transformer = ColumnTransformer(transformers=[
    ('cat_ohe', ohe_transformer_pipe, ['col_for_ohe']),
    ('cat_tf', tf_transformer_pipe, ['col_for_countvectorizer_1', 'col_for_countvectorizer_2'])
], remainder='passthrough')

transformed_df = transformer.fit_transform(df)

Namely, I'm adding a transformer that simply transforms the array returned by the SimpleImputer instance in a DataFrame. Then - and most importantly - since it seems not possible to apply the vectorization on the 2D input that comes out of the previous two steps ('imputer' and 'dt') I'm adding a further ColumnTransformer which splits the vectorization in two parallel steps (a vectorization per column). Notice that at this point columns are referenced positionally as column names have possibly changed. Of course, that's a custom solution, but at least may provide some hints.

Given that you don't actually have missing values, you can see that it actually works by comparing it with the output from:

dt = DimTransformer().fit_transform(df)
ct = ColumnTransformer([
    ('tf1', tf_transformer, 1), 
    ('tf2', tf_transformer, 2)
])
ct.fit_transform(dt)

print(ct.named_transformers_['tf1'].kw_args['vectorizer_tf'].vocabulary_) print(ct.named_transformers_['tf2'].kw_args['vectorizer_tf'].vocabulary_)

and noticing that columns from fourth to the last but one of the previous output (namely those affected by the application of 'cat_tf') do coincide with the ones just below.

Here are a couple of posts with focus on the usage of CountVectorizer in a ColumnTransformer instance, though they did not consider imputing the dataset beforehand.