Search code examples
python-3.xscikit-learnpipelineattributeerrortfidfvectorizer

AttributeError: 'numpy.ndarray' object has no attribute 'lower' in pipeline


I'm doing some nlp classification, and I want to do a stacking ensemble.

My original data contains different level of descriptions of each class. For example, for 1 instance, we could originally have a column with its name, one with a short description, one with a description of its sub-category, and so on.

The X_train that I have in my code above, is where each column contains all the words for some granularity. E.g. the first column could be the short description, the second column the words of the sub-category description and from another source, and the third columns a lot of more words from more granular categories.

I'm including the work flow of pipe, pipe_2 wrapped in the StackingClassifier as it is what I'm trying to do, but I get the same error if I just try to run that pipe_1 as a standalone (fitting on pipe_1 directly).

I tried to change the X_train and y_train format (with ravel(), and .tolist()), but I'm thinking that perhaps the format problem is appearing when the pipeline is using ColumnSelector, and I'm not sure how to approach that.

The type of X_train(<class 'pandas.core.frame.DataFrame'>) and y_train(<class 'pandas.core.series.Series'>) are the same as when I do a successful non-stacking run of it. For the successful run, what is passed to the fit method is a <class 'scipy.sparse.csr.csr_matrix'>. I guess the same would be true in the stacking example, has I expect TfidfVectorizer to delivers that. The major difference I see (and I think perhaps it might create the problematic numpy.ndarray per rows as there is more than one columns?) is that for the stacking one, X_train has more than one column. But I would have expected the ColumnSelector in make_pipeline to "take charge of that problem".

import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from mlxtend.feature_selection import ColumnSelector
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingClassifier

# creating my toy trainset and testset
start = [
    ['apple this is painful Two wrongs make a right ok',
     'just a batch of suspicious words and banana',
     'another batch of fake words and another apple'],
    ['Fortune favors the italic sunny sunshine',
     'name of a company and then its description',
     'is it all sunshine or doomed to fail to no sunshine'],
    ['this was it when in rome do as the romans and make fortune',
     'well again the same thing and those descriptions',
     'lets make that work and bring the fortune'],
    ['Ok this is the last one and then its the end',
     'is it the beggining of the end or the end of the beggining',
     'allelouia']
]

X_train = pd.DataFrame(
    start, columns=['High_level', 'Mid_level', 'Low_level'])
y_train = ['A', 'B', 'C', 'D']
X_test = pd.DataFrame([['mostly apple'], ['bunch of apple'],
                       ['lot of fortune'], ['make fortune and bring the'],
                       ['beggining of the end']])
y_true = ['A', 'A', 'C', 'C', 'D']

The error appears in that next line:

pipe_1 = make_pipeline(ColumnSelector(cols=(1,)), TfidfVectorizer(min_df=1),
                     LogisticRegression(multi_class='multinomial'))
pipe_2 = make_pipeline(ColumnSelector(cols=(2,)), TfidfVectorizer(min_df=1),
                     LogisticRegression(multi_class='multinomial'))
sclf = StackingClassifier(
        classifiers=[pipe_1, pipe_2],
        meta_classifier=LogisticRegression(
            solver='lbfgs', multi_class='multinomial',
            C=1.0, class_weight='balanced', tol=1e-6, max_iter=1000,
            n_jobs=-1))
predictions = sclf.fit(X_train, y_train).predict(X_test)

Here's the complete error:

Traceback (most recent call last):
  File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
    predictions = sclf.fit(X_train, y_train).predict(X_test)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
    clf.fit(X, y)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
    **fit_params_steps[name])
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
    X = super().fit_transform(raw_documents)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
    for feature in analyze(doc):
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 256, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

Process finished with exit code 1

And if I change for lowercase=False in TfidfVectorizer, I get a different kind of error:

Traceback (most recent call last):
  File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
    predictions = sclf.fit(X_train, y_train).predict(X_test)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
    clf.fit(X, y)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
    **fit_params_steps[name])
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
    X = super().fit_transform(raw_documents)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
    for feature in analyze(doc):
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 265, in <lambda>
    return lambda doc: token_pattern.findall(doc)
TypeError: cannot use a string pattern on a bytes-like object

Solution

  • I was facing the same issue. I solved it by adding drop_axis = True to the ColumnSelector. This parameter needs to be added when only one column is to be selected.

    Please refer to the API here: http://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/#api