python pandas dataframe machine-learning logistic-regression

LogisticRegression model producing 100 percent accuracy

I have fetched Amazon Reviews for a product and now trying to train a logistic regression model on it to categorize customer reviews. It gives 100 percent accuracy. I am unable to understand the issue. Here is a sample from my dataset:

Name	Stars	Title	Date	Description
Dipam	5	5.0 out of 5 stars	N/A	A very good fragrance. Recommended Seller - Sun Fragrances
sanket shah	5	5.0 out of 5 stars	N/A	Yes
Manoranjidham	5	5.0 out of 5 stars	N/A	This perfume is ranked No 3 .. Good one :)
Moukthika	5	5.0 out of 5 stars	N/A	I was gifted Versace Bright on my 25th Birthday. Fragrance stays for at least for 24 hours. I love it. This is one of my best collections.
megh	5	5.0 out of 5 stars	N/A	I have this perfume but didn't get it online..the smell is just amazing.it stays atleast for 2 days even if you take bath or wash d cloth. I have got so many compliments..
riya	5	5.0 out of 5 stars	N/A	Bought it from somewhere else,awesome fragrance, pure rose kind of smell stays for long,my guy loves this purchase of mine n fragrance too.
manisha.chauhan0091	5	5.0 out of 5 stars	N/A	Its light n long lasting i like it
UPS	1	1.0 out of 5 stars	N/A	Absolutely fake. Fragrance barely lasts for 15 minutes. Extremely harsh on the skin as well.
sanaa	1	1.0 out of 5 stars	N/A	a con game. fake product. dont fall for it
Juliana Soares Ferreira	N/A	Ótimo produto	N/A	Produto verdadeiro, com cheio da riqueza, não fixa muito, mas é delicioso. Dura na minha pele umas 3 horas e depois fica um cheirinho leve...Super recomendo

Here is my code

import re
import nltk
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

# Ensure necessary NLTK datasets and models are downloaded
# nltk.download('punkt')
# nltk.download('vader_lexicon')

# Load the data
df = pd.read_csv("reviews.csv")  # Make sure to replace 'reviews.csv' with your actual file path

# Preprocess data
df['Stars'] = df['Stars'].fillna(3.0)  # Handle missing values
df['Title'] = df['Title'].str.lower()  # Standardize text formats
df['Description'] = df['Description'].str.lower()
df = df.drop(['Name', 'Date'], axis=1)  # Drop unnecessary columns
print(df)


# Categorize sentiment based on star ratings
def categorize_sentiment(stars):
    if stars >= 4.0:
        return 'Positive'
    elif stars <= 2.0:
        return 'Negative'
    else:
        return 'Neutral'


df['Sentiment'] = df['Stars'].apply(categorize_sentiment)


# Clean and tokenize text
def clean_text(text):
    text = BeautifulSoup(text, "html.parser").get_text()
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    return letters_only.lower()


def tokenize(text):
    return word_tokenize(text)


df['Clean_Description'] = df['Description'].apply(clean_text)
df['Tokens'] = df['Clean_Description'].apply(tokenize)

# Apply NLTK's VADER for sentiment analysis
sia = SentimentIntensityAnalyzer()


def get_sentiment(text):
    score = sia.polarity_scores(text)
    if score['compound'] >= 0.05:
        return 'Positive'
    elif score['compound'] <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'


df['NLTK_Sentiment'] = df['Clean_Description'].apply(get_sentiment)
print("df['NLTK_Sentiment'].value_counts()")
print(df['NLTK_Sentiment'].value_counts())

# Prepare data for machine learning
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(tokenizer=tokenize)
X = vectorizer.fit_transform(df['Clean_Description'])
y = df['NLTK_Sentiment'].apply(lambda x: 1 if x == 'Positive' else 0)

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=80)

# Train a Logistic Regression model

# Compute class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = dict(enumerate(class_weights))
print(f"class_weights_dict {class_weights_dict}")
# Apply to Logistic Regression
# model = LogisticRegression(class_weight=class_weights_dict)
model = LogisticRegression(C=0.001, penalty='l2', class_weight='balanced')

model.fit(X_train, y_train)

# Predict sentiments on the test set
predictions = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions, average='weighted')
recall = recall_score(y_test, predictions, average='weighted')
f1 = f1_score(y_test, predictions, average='weighted')

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Here are the results of the print statements:

NLTK_Sentiment
Positive 8000
Negative 2000
Name: count, dtype: int64

class_weights_dict {0: 2.3696682464454977, 1: 0.6337135614702155}
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000

I am unable to find the reason why my model is always giving 100 percent accuracy.

Solution

Your NLTK_Sentiment column is based on the sentiment of the Clean_Description column. The X column is also based off of the Clean_Description column.

You are essentially testing if there is a linear relationship between the count of the number of occurrences of each token, and the VADER categorization. Since VADER works by assigning each word a score between -4 and 4, and summing them up, this is a linear relationship. (There are some exceptions to this - VADER is capable of recognizing some idioms like 'bad ass,' or negations like 'not good,' but outside of those special cases, it's linear.)

For that reason, logistic regression is essentially just recovering the word-level weights in VADER. You're giving it a problem which is easy, and that's why you get such a high score.