I am doing a basic NLP project on Sentiment Analysis, and I would like to use GridsearchCV to optimise my model.
The code below shows a sample dataframe I am working with. 'Content' is the column to pass to CountVectorizer, 'label' is the y column to be predicted, and feature_1, feature_2 are columns I wish to include in my model as well.
'content': 'Got flat way today Pot hole Another thing tick crap thing happen week list',
'feature_1': '1',
'feature_2': '34',
'label':1},
{'content': 'UP today Why doe head hurt badly',
'feature_1': '5',
'feature_2': '142',
'label':1},
{'content': 'spray tan fail leg foot Ive scrubbing foot look better ',
'feature_1': '7',
'feature_2': '123',
'label':0},])
I am following a stackoverflow answer: Perform feature selection using pipeline and gridsearch
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin, BaseEstimator
class CustomFeatureExtractor(BaseEstimator, TransformerMixin):
def __init__(self, feature_1=True, feature_2=True):
self.feature_1=feature_1
self.feature_2=feature_2
def extractor(self, tweet):
features = []
if self.feature_2:
features.append(df['feature_2'])
if self.feature_1:
features.append(df['feature_1'])
return np.array(features)
def fit(self, raw_docs, y):
return self
def transform(self, raw_docs):
return np.vstack(tuple([self.extractor(tweet) for tweet in raw_docs]))
Below is the gridsearch I tried to fit my dataframe on:
lr = LogisticRegression()
# Pipeline
pipe = Pipeline([('features', FeatureUnion([("vectorizer", CountVectorizer(df['content'])),
("extractor", CustomFeatureExtractor())]))
,('classifier', lr())
])
But yields results: TypeError: 'LogisticRegression' object is not callable
Wonder if there are any other easier ways to do this?
I have already referred to the threads below, however, to no avail: How to combine TFIDF features with other features Perform feature selection using pipeline and gridsearch
You cannot do lr()
, LogisticRegression
is not callable indeed, it has some methods for the lr
object.
Try instead (lr
without brackets):
lr = LogisticRegression()
pipe = Pipeline([('features', FeatureUnion([("vectorizer", CountVectorizer(df['content'])),
("extractor", CustomFeatureExtractor())]))
,('classifier', lr)
])
and your error message should disappear.