Search code examples
pythonnlppipelinemodeling

Gridsearch for NLP - How to combine CountVec and other features?


I am doing a basic NLP project on Sentiment Analysis, and I would like to use GridsearchCV to optimise my model.

The code below shows a sample dataframe I am working with. 'Content' is the column to pass to CountVectorizer, 'label' is the y column to be predicted, and feature_1, feature_2 are columns I wish to include in my model as well.

'content': 'Got flat way today Pot hole Another thing tick crap thing happen week list',
'feature_1': '1', 
'feature_2': '34', 
'label':1}, 
{'content': 'UP today Why doe head hurt badly',
'feature_1': '5', 
'feature_2': '142', 
'label':1},
{'content': 'spray tan fail leg foot Ive scrubbing foot look better ',
 'feature_1': '7', 
'feature_2': '123', 
'label':0},])

I am following a stackoverflow answer: Perform feature selection using pipeline and gridsearch

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin, BaseEstimator
class CustomFeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, feature_1=True, feature_2=True):
        self.feature_1=feature_1
        self.feature_2=feature_2
        
    def extractor(self, tweet):
        features = []

        if self.feature_2:
            
            features.append(df['feature_2'])

        if self.feature_1:
            features.append(df['feature_1'])
        
          
        return np.array(features)

    def fit(self, raw_docs, y):
        return self

    def transform(self, raw_docs):
        
        return np.vstack(tuple([self.extractor(tweet) for tweet in raw_docs]))

Below is the gridsearch I tried to fit my dataframe on:

lr = LogisticRegression()

# Pipeline
pipe = Pipeline([('features', FeatureUnion([("vectorizer", CountVectorizer(df['content'])),
                                            ("extractor", CustomFeatureExtractor())]))
                 ,('classifier', lr())
                ])
But yields results: TypeError: 'LogisticRegression' object is not callable

Wonder if there are any other easier ways to do this?

I have already referred to the threads below, however, to no avail: How to combine TFIDF features with other features Perform feature selection using pipeline and gridsearch


Solution

  • You cannot do lr(), LogisticRegression is not callable indeed, it has some methods for the lr object.

    Try instead (lr without brackets):

    lr = LogisticRegression()
    pipe = Pipeline([('features', FeatureUnion([("vectorizer", CountVectorizer(df['content'])),
                                                ("extractor", CustomFeatureExtractor())]))
                     ,('classifier', lr)
                    ])
    

    and your error message should disappear.