python machine-learning scikit-learn pipeline feature-selection

Pipeline using multiple columns

I have a binary classification problem. My dataset consists of columns of different types: binary (0 or 1) or textual (text from emails). I have more than 40 columns.

An example may be the following:

Text                             is_it_capital?     is_it_upper?      contains_num?   Label
an example of text                      0                  0               0            0
ANOTHER example of text                 1                  1               0            1
What's happening?Let's talk at 5        1                  0               1            1

I am trying to use pipeline in order to make the prediction. However, the fact I have already encoded some columns (is_it_capital?, ....) is not helping me a lot, as I do not know how to add these columns (features) in my pipeline. All of them are numerical and they take values either 1 or 0 (checked using numerical_columns = train_set.select_dtypes(include=[np.number])).

If I had not already encoded that columns, probably FeatureUnion would have been a good solution; in this case, I have no idea on how to proceed.

I have tried as follows

  nb_pipeline = Pipeline([
            ('NBCV',extract_func. tf_idf_n),
            ('nb_clf',MultinomialNB())])
    nb_pipeline.fit(train_set,train_set['Label']) # I am considering the whole training set
    predicted_nb = nb_pipeline.predict(test_set)
    np.mean(predicted_nb == test_set['Label'])

but I got the error

ValueError: Found input variables with inconsistent numbers of samples: [30, 4394]

I am splitting the dataset into train (80%) and test (20%) using train_test_split. y is only Label, while X contains all the other columns in my example. After splitting the dataset, I concatenate X_train and y_train as follows:

train_set= pd.concat([X_train, y_train], axis=1)
test_set = pd.concat([X_test, y_test], axis=1)

FULL TRACK OF ERROR:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-50-bab0cc0a9f07> in <module>
      6         ('nb_clf',MultinomialNB())])
      7 
----> 8 nb_pipeline.fit(train_set.drop('Label', axis=1), train_set['Label'])
      9 predicted_nb = nb_pipeline.predict(test_set.drop('Label', axis=1))
     10 np.mean(predicted_nb == test_set['Label'])

/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    333             if self._final_estimator != 'passthrough':
    334                 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
--> 335                 self._final_estimator.fit(Xt, y, **fit_params_last_step)
    336 
    337         return self

/anaconda3/lib/python3.7/site-packages/sklearn/naive_bayes.py in fit(self, X, y, sample_weight)
    613         self : object
    614         """
--> 615         X, y = self._check_X_y(X, y)
    616         _, n_features = X.shape
    617         self.n_features_ = n_features

/anaconda3/lib/python3.7/site-packages/sklearn/naive_bayes.py in _check_X_y(self, X, y)
    478 
    479     def _check_X_y(self, X, y):
--> 480         return self._validate_data(X, y, accept_sparse='csr')
    481 
    482     def _update_class_log_prior(self, class_prior=None):

/anaconda3/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    430                 y = check_array(y, **check_y_params)
    431             else:
--> 432                 X, y = check_X_y(X, y, **check_params)
    433             out = X, y
    434 

/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    810         y = y.astype(np.float64)
    811 
--> 812     check_consistent_length(X, y)
    813 
    814     return X, y

/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    254     if len(uniques) > 1:
    255         raise ValueError("Found input variables with inconsistent numbers of"
--> 256                          " samples: %r" % [int(l) for l in lengths])
    257 
    258 

ValueError: Found input variables with inconsistent numbers of samples: [29, 4394]

Solution

From the traceback, you can see that the tfidf transformer completes, and the NB model is what breaks. I suspect the tfidf is not doing what you expect it to, because it is treating the entire frame as an iterable of columns to be encoded; so it thinks there are only 29 "documents", and so the NB sees 29 training rows with 4394 labels.

I believe something like the following should work the way you want it to.

ct = ColumnTransformer(
    transformers=[('tfidf', extract_func.tf_idf_n, 'Text')],
    remainder='passthrough',
)
nb_pipeline = Pipeline([
    ('preproc', ct),
    ('nb_clf', MultinomialNB())
])
nb_pipeline.fit(train_set.drop('Label', axis=1), train_set['Label'])