Search code examples
pythonscikit-learnfeature-extraction

Transfomers for mixed data types


I'm having trouble applying at once different transformers to columns with different types (text vs numerical), and concatenating such transformers in a single one for later use.

I tried to follow the steps in the documentation for Column Transformer with Mixed Types, which explains how to do that for a mix of categorical and numerical data, but it doesn't seem to work with text data.

TL;DR

How do you create a storable transformer that follows different pipelines for text and numerical data?

Data download and preparation

# imports
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler

np.random.seed(0)

# download Titanic data
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

# data preparation
numeric_features = ['age', 'fare']
text_features = ['name', 'cabin', 'home.dest']
X.fillna({text_col: '' for text_col in text_features}, inplace=True)

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Transforming numerical features: ok

Following the steps in the link above, one can create a transformer for the numerical features as follows:

# handling missing data and normalization
numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                      ('scaler', StandardScaler())])

num_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features)])

# this works
num_preprocessor.fit(X_train)
train_feature_set = num_preprocessor.transform(X_train)
test_feature_set = num_preprocessor.transform(X_test)

# verify shape = (number of data points, number of numerical features (2) )
train_feature_set.shape  # (1047, 2)
test_feature_set.shape  # (262, 2)

Transforming text features: ok

To process text features, I vectorize each text column with TF-IDF (as opposed to concatenating all text columns, and applying TF-IDF just once):

# Tfidf of max 30 features
text_transformer = TfidfVectorizer(use_idf=True,
                                   max_features=30)
# apply separately to each column
text_transformer_list = [(x + '_vectorizer', text_transformer, x) for x in text_features]
text_preprocessor = ColumnTransformer(transformers=text_transformer_list)

# this works
text_preprocessor.fit(X_train)
train_feature_set = text_preprocessor.transform(X_train)
test_feature_set = text_preprocessor.transform(X_test)

# verify shape = (number of data points, number of text features (3) times max_features(30) )
train_feature_set.shape  # (1047, 90)
test_feature_set.shape  # (262, 90)

How do you do both at once?

I've tried various strategies to save both above procedures in a single transformer, but they all fail due to different errors.

Attempt 1: Follow documented strategy

Following the documentation (Column Transformer with Mixed Types) doesn't work, once text data replaces categorical data:

# documented strategy
sum_preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_features),
                                                   ('text', text_transformer, text_features)])
# fails
sum_preprocessor.fit(X_train)

returns following error message:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1047 and the array at index 1 has size 3

Attempt 2: FeatureUnion on the lists of transformers

# create a list of numerical transformer, like those for text
numerical_transformer_list = [(x + '_scaler', numeric_transformer, x) for x in numeric_features]

# fails
column_trans = FeatureUnion([text_transformer_list, numerical_transformer_list])

returns following error message:

TypeError: All estimators should implement fit and transform. '('cabin_vectorizer', TfidfVectorizer(max_features=30), 'cabin')' (type <class 'tuple'>) doesn't

Attempt 3: ColumnTransformer on the lists of transformers

# create a list of all transformers, text and numerical
sum_transformer_list = text_transformer_list + numerical_transformer_list

# works
sum_preprocessor = ColumnTransformer(transformers=sum_transformer_list)

# fails
sum_preprocessor.fit(X_train)

returns following error message:

ValueError: Expected 2D array, got 1D array instead:
array=[54. nan nan ... 20. nan nan].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

My question

How do I create a single object that can fit and transform data mixing text and numerical types?


Solution

  • Short answer:

    all_transformers = text_transformer_list + [('num', numeric_transformer, numeric_features)]
    
    all_preprocessor = ColumnTransformer(transformers=all_transformers)
    
    all_preprocessor.fit(X_train)
    train_all = all_preprocessor.transform(X_train)
    test_all = all_preprocessor.transform(X_test)
    
    print(train_all.shape, test_all.shape)
    # prints (1047, 92) (262, 92)
    

    The difficulty here is that (most?) text transformers expect 1-dimensional input, but (most?) numerical transformers expect 2-dimensional input. ColumnTransformer handles that by allowing you to specify a single column or a list of columns: in the first case, the 1d array is passed on to the transformer, and in the second a 2d array is passed.

    So, to explain the errors in the three attempts:

    Attempt 1: The TF-IDF is receiving a 2d array, and treats the columns as the documents not the individual entries, and so produces just three outputs. When it tries to concatenate that to the 1047-row numerical output, it fails.

    Attempt 2: FeatureUnion doesn't have the same input format as ColumnTransformer: you shouldn't have triples (name, transformer, columns) in this case. Anyway, FeatureUnion isn't meant for what you're doing here.

    Attempt 3: This time you're trying to send 1d data through to the numerical transformer, but those are expecting 2d data.