Search code examples
pythonmachine-learningscikit-learnimputationcountvectorizer

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?


I have a pandas DataFrame that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value before vectorizing.

My initial idea was to create a Pipeline of SimpleImputer and CountVectorizer:

import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

However, the fit_transform errors because SimpleImputer outputs a 2D array and CountVectorizer requires 1D input. Here's the error message:

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

QUESTION: How can I modify this Pipeline so that it will work?

NOTE: I'm aware that I can impute missing values in pandas. However, I would like to accomplish all preprocessing in scikit-learn so that the same preprocessing can be applied to new data using Pipeline.


Solution

  • The best solution I have found is to insert a custom transformer into the Pipeline that reshapes the output of SimpleImputer from 2D to 1D before it is passed to CountVectorizer.

    Here's the complete code:

    import pandas as pd
    import numpy as np
    df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})
    
    from sklearn.impute import SimpleImputer
    imp = SimpleImputer(strategy='constant')
    
    from sklearn.feature_extraction.text import CountVectorizer
    vect = CountVectorizer()
    
    # CREATE TRANSFORMER
    from sklearn.preprocessing import FunctionTransformer
    one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})
    
    # INCLUDE TRANSFORMER IN PIPELINE
    from sklearn.pipeline import make_pipeline
    pipe = make_pipeline(imp, one_dim, vect)
    
    pipe.fit_transform(df[['text']]).toarray()
    

    It has been proposed on GitHub that CountVectorizer should allow 2D input as long as the second dimension is 1 (meaning: a single column of data). That modification to CountVectorizer would be a great solution to this problem!