I'm doing some nlp classification, and I want to do a stacking ensemble.
My original data contains different level of descriptions of each class. For example, for 1 instance, we could originally have a column with its name, one with a short description, one with a description of its sub-category, and so on.
The X_train that I have in my code above, is where each column contains all the words for some granularity. E.g. the first column could be the short description, the second column the words of the sub-category description and from another source, and the third columns a lot of more words from more granular categories.
I'm including the work flow of pipe
, pipe_2
wrapped in the StackingClassifier
as it is what I'm trying to do, but I get the same error if I just try to run that pipe_1
as a standalone (fitting on pipe_1
directly).
I tried to change the X_train
and y_train
format (with ravel()
, and .tolist()
), but I'm thinking that perhaps the format problem is appearing when the pipeline is using ColumnSelector
, and I'm not sure how to approach that.
The type of X_train(<class 'pandas.core.frame.DataFrame'>
) and y_train(<class 'pandas.core.series.Series'>)
are the same as when I do a successful non-stacking run of it. For the successful run, what is passed to the fit method is a <class 'scipy.sparse.csr.csr_matrix'>
. I guess the same would be true in the stacking example, has I expect TfidfVectorizer
to delivers that. The major difference I see (and I think perhaps it might create the problematic numpy.ndarray
per rows as there is more than one columns?) is that for the stacking one, X_train
has more than one column. But I would have expected the ColumnSelector
in make_pipeline
to "take charge of that problem".
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from mlxtend.feature_selection import ColumnSelector
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingClassifier
# creating my toy trainset and testset
start = [
['apple this is painful Two wrongs make a right ok',
'just a batch of suspicious words and banana',
'another batch of fake words and another apple'],
['Fortune favors the italic sunny sunshine',
'name of a company and then its description',
'is it all sunshine or doomed to fail to no sunshine'],
['this was it when in rome do as the romans and make fortune',
'well again the same thing and those descriptions',
'lets make that work and bring the fortune'],
['Ok this is the last one and then its the end',
'is it the beggining of the end or the end of the beggining',
'allelouia']
]
X_train = pd.DataFrame(
start, columns=['High_level', 'Mid_level', 'Low_level'])
y_train = ['A', 'B', 'C', 'D']
X_test = pd.DataFrame([['mostly apple'], ['bunch of apple'],
['lot of fortune'], ['make fortune and bring the'],
['beggining of the end']])
y_true = ['A', 'A', 'C', 'C', 'D']
The error appears in that next line:
pipe_1 = make_pipeline(ColumnSelector(cols=(1,)), TfidfVectorizer(min_df=1),
LogisticRegression(multi_class='multinomial'))
pipe_2 = make_pipeline(ColumnSelector(cols=(2,)), TfidfVectorizer(min_df=1),
LogisticRegression(multi_class='multinomial'))
sclf = StackingClassifier(
classifiers=[pipe_1, pipe_2],
meta_classifier=LogisticRegression(
solver='lbfgs', multi_class='multinomial',
C=1.0, class_weight='balanced', tol=1e-6, max_iter=1000,
n_jobs=-1))
predictions = sclf.fit(X_train, y_train).predict(X_test)
Here's the complete error:
Traceback (most recent call last):
File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
predictions = sclf.fit(X_train, y_train).predict(X_test)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
clf.fit(X, y)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
**fit_params_steps[name])
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
return self.func(*args, **kwargs)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
X = super().fit_transform(raw_documents)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
for feature in analyze(doc):
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 256, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
Process finished with exit code 1
And if I change for lowercase=False
in TfidfVectorizer
, I get a different kind of error:
Traceback (most recent call last):
File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
predictions = sclf.fit(X_train, y_train).predict(X_test)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
clf.fit(X, y)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
**fit_params_steps[name])
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
return self.func(*args, **kwargs)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
X = super().fit_transform(raw_documents)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
for feature in analyze(doc):
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 265, in <lambda>
return lambda doc: token_pattern.findall(doc)
TypeError: cannot use a string pattern on a bytes-like object
I was facing the same issue. I solved it by adding drop_axis = True
to the ColumnSelector
. This parameter needs to be added when only one column is to be selected.
Please refer to the API here: http://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/#api