Search code examples
pythonpython-3.xscikit-learnlogistic-regression

Bad input shape on sentiment analysis logistic regression


I'd like to predict the accuracy of a Sentiment Analysis model with Logistic Regression, but get the error: bad input shape (edited with inputs)

Data Frame:

df
sentence                | polarity_label
new release!            | positive
buy                     | neutral
least good-looking      | negative

Code:

from sklearn.preprocessing import OneHotEncoder                                                   
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, 
ENGLISH_STOP_WORDS
# Define the set of stop words
my_stop_words = ENGLISH_STOP_WORDS
vect = CountVectorizer(max_features=5000,stop_words=my_stop_words)
vect.fit(df.sentence)
X = vect.transform(df.sentence)
y = df.polarity_label
encoder = OneHotEncoder()
encoder.fit_transform(y)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=123)
LogisticRegression(penalty='l2',C=1.0)

log_reg = LogisticRegression().fit(X_train, y_train)

Error Message

ValueError: Expected 2D array, got 1D array instead:
array=['Neutral' 'Positive' 'Positive' ... 'Neutral' 'Neutral' 'Neutral'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.```

How can I fix this?

Solution

  • Adjust your code like this for example:

    y = df.polarity_label
    

    Currenlty you are trying to transform your y into a vector with your CountVectorizer, which is trained on the sentence data.

    So the CountVectorizer has this vocabulary (you can get it with vect.get_feature_names()):

    ['buy', 'good', 'looking', 'new', 'release']

    and will transform some text, that contains these words into a vector.

    But when you use this on your y, which has only the words positive, neutral, negative, it does not find any of its "known" words and therefore your y is empty.

    If you inspect your y after that transformation, you can also see that it is empty:

    <3x5 sparse matrix of type '<class 'numpy.int64'>'
        with 0 stored elements in Compressed Sparse Row format>