Search code examples
pythoncounter

Counter() and most_common


I am using a Counter() for counting words in the excel file. My goal is to acquire the most frequent words from the document. The problem that Counter() does not work properly with my file. Here is the code:

#1. Building a Counter with bag-of-words

import pandas as pd
df = pd.read_excel('combined_file.xlsx', index_col=None)
import nltk

from nltk.tokenize import word_tokenize

# Tokenize the article: tokens
df['tokens'] = df['body'].apply(nltk.word_tokenize)

# Convert the tokens into string values
df_tokens_list = df.tokens.tolist()

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [[string.lower() for string in sublist] for sublist in df_tokens_list]

# Import Counter

from collections import Counter

# Create a Counter with the lowercase tokens: bow_simple

bow_simple = Counter(x for xs in lower_tokens for x in set(xs))

# Print the 10 most common tokens
print(bow_simple.most_common(10))

#2. Text preprocessing practice

# Import WordNetLemmatizer

from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in bow_simple if t.isalpha()]

# Remove all stop words: no_stops 
from nltk.corpus import stopwords

no_stops = [t for t in alpha_only if t not in stopwords.words("english")]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)
print(bow)
# Print the 10 most common tokens
print(bow.most_common(10))

The most frequent words after preprocessing are:

[('dry', 3), ('try', 3), ('clean', 3), ('love', 2), ('one', 2), ('serum', 2), ('eye', 2), ('boot', 2), ('woman', 2), ('cream', 2)]

This is not true if we count these words by hand in excel. Do you have any idea what might be wrong with my code? I would appreciate any help in that regard.

The link to the file is here: https://www.dropbox.com/scl/fi/43nu0yf45obbyzprzc86n/combined_file.xlsx?dl=0&rlkey=7j959kz0urjxflf6r536brppt


Solution

  • The problem is that the bow_simple value is a counter, which you further process. This means that all items will appear only once in the list, the end result is merely counting how many variations of the words appear in the counter when lowered and processed with nltk. The solution is to create a flattened wordlist and feed that into alpha_only:

    # Create a Counter with the lowercase tokens: bow_simple
    wordlist = [item for sublist in lower_tokens for item in sublist] #flatten list of lists
    bow_simple = Counter(wordlist)
    

    Then use wordlist in alpha_only:

    alpha_only = [t for t in wordlist if t.isalpha()]
    

    Output:

    [('eye', 3617), ('product', 2567), ('cream', 2278), ('skin', 1791), ('good', 1081), ('use', 1006), ('really', 984), ('using', 928), ('feel', 798), ('work', 785)]