Search code examples
pythonmachine-learningscikit-learnone-hot-encoding

Process different colum with different pre-processing process


I have the following df

         text     count     daytime        label
   I think...        4      morning          pos
You should...        3    afternoon          neg
    Better...        7      evening          neu

I tried to only preprocess the text column using ColumnTransform by using

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer([
    ('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text')
], remainder='passthrough')

It worked fine. Then I want to apply count and daytime respectively by using the following code

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

transformer = ColumnTransformer([
    ('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text'),
    ('scaler', StandardScaler(), 'count'),
    ('enc', OneHotEncoder(), 'daytime')
], remainder='passthrough')

X_transformed = transformer.fit_transform(X)

It gave me error

1D data passed to a transformer that expects 2D data. Try to specify the column selection as a list of one item instead of a scalar.

I think the problem is with standardscaler, where it only passes 1D. How do I solve this?


Solution

  • You have to separate each tuple in the list of tuples with commas. Since StandardScaler and OneHotEncoder expect 2D inputs, you should, as the error message suggests, pass the column selectors as a list of one item for these transformers:

    transformer = ColumnTransformer([
        ('vectorizer', TfidfVectorizer(ngram_range=(1, 1)), 'text'), 
        ('scaler', StandardScaler(), ['count']),  
        ('enc', OneHotEncoder(), ['daytime'])
    ], remainder='passthrough')