Search code examples
pythonscikit-learnsklearn-pandas

KeyError on FeatureUnion between TfDif and custom features


I am trying to create a model where I'll use TfidfVectorizer on a text column and also a couple of other columns with extra data on the text. The code below reproduces what I'm trying to do and the error I get.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import DictVectorizer
from sklearn.naive_bayes import BernoulliNB

class ParStats(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        print(X[0])
        return [{'feat_1': x['feat_1'],
                 'feat_2': x['feat_2']}
                for x in X]

class ItemSelector(BaseEstimator, TransformerMixin):

    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

def feature_union_test():

    # create test data frame
    test_data = {
        'text': ['And the silken, sad, uncertain rustling of each purple curtain',
                 'Thrilled me filled me with fantastic terrors never felt before',
                 'So that now, to still the beating of my heart, I stood repeating',
                 'Tis some visitor entreating entrance at my chamber door',
                 'Some late visitor entreating entrance at my chamber door',
                 'This it is and nothing more'],
        'feat_1': [4, 7, 10, 7, 4, 6],
        'feat_2': [1, 5, 5, 1, 1, 10],
        'ignore': [1, 1, 1, 0, 0, 0]
    }
    test_df = pd.DataFrame(data=test_data)
    y_train = test_df['ignore'].values.astype('int')

    # Feature Union Pipeline
    pipeline = FeatureUnion([

                ('text', Pipeline([
                    ('selector', ItemSelector(key='text')),
                    ('tfidf', TfidfVectorizer(max_df=0.5)),
                ])),

                ('parstats', Pipeline([
                    ('stats', ParStats()),
                    ('vect', DictVectorizer()),
                ]))

            ])

    tfidf = pipeline.fit_transform(test_df)

    # fits Naive Bayes
    clf = BernoulliNB().fit(tfidf, y_train)

feature_union_test()

When I run this, I get the following error messages:

Traceback (most recent call last):
  File "C:\Users\Rogerio\Python VENV\lib\site-packages\pandas\core\indexes\base.py", line 3064, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

I've tried several different iterations of the pipeline and I always get some sort of error, so obviously I'm missing something. What am I doing wrong?


Solution

  • Ok. So after discussion in comments, this is your problem statement.

    You want to pass the columns feat_1, feat_2 along with the tfidf of text column to your ml model.

    So the only thing you need to do is this:

    # Feature Union Pipeline
    pipeline = FeatureUnion([('text', Pipeline([('selector', ItemSelector(key='text')),
                                                ('tfidf', TfidfVectorizer(max_df=0.5)),
                                               ])),
                             ('non_text', ItemSelector(key=['feat_1', 'feat_2']))
                            ])
    
    tfidf = pipeline.fit_transform(test_df)
    

    The default ItemSelector can be used to select multiple features at once which will be appended to the last of the tfidf data return from text part of feature Union.