Search code examples
pythonpython-3.xpython-collections

Collections.counter() is counting alphabets instead of words


I have to count no. of most occured word from a dataframe in row df['messages']. It have many columns so I formatted and stored all rows as single string (words joint by space) in one variabel all_words. all_words have all words seperated by space. But when i tried to count most common word it shows me most used alphabet. My data is in form:

0    abc de fghi klm
1    qwe sd fd s dsdd sswd??
3    ded fsf sfsdc wfecew wcw.

Here is snippet of my code.

   from collections import Counter
    all_words = ' '
    for msg in df['messages'].values:
        words = str(msg).lower()
        all_words = all_words + str(words) + ' '
            
    count = Counter(all_words)
    count.most_common(3)

And here is its output:

[(' ', 5260), ('a', 2919), ('h', 1557)]

I also tried using df['messages'].value_counts(). But it returns most used rows(whole sentence) instead of words. Like:

asad adas asda     10
asaa as awe        3
wedxew dqwed       1

Please tell me where I am wrong or suggest any other method that might work.


Solution

  • Counter iterates over what you pass to it. If you pass it a string, it goes into iterating it has chars (and that's what it will count). If you pass it a list (where each list is a word), it will count by words.

    from collections import Counter
    
    text = "spam and more spam"
    
    c = Counter()
    c.update(text)  # text is a str, count chars
    c
    # Counter({'s': 2, 'p': 2, 'a': 3, 'm': 3, [...], 'e': 1})
    
    c = Counter()
    c.update(text.split())  # now is a list like: ['spam', 'and', 'more', 'spam']
    c
    # Counter({'spam': 2, 'and': 1, 'more': 1})
    

    So, you should do something like that:

    from collections import Counter
    
    all_words = []
    for msg in df['messages'].values:
        words = str(msg).lower() 
        all_words.append(words)
    
    count = Counter(all_words)
    count.most_common(3)
    
    # the same, but with  generator comprehension
    count = Counter(str(msg).lower() for msg in df['messages'].values)