python refactoring counter dictionary-comprehension

Comprehension dictionary counter and refactoring python code

I'm learning Python by myself, I'm starting to refactor Python code to learn new and efficient ways to code.

I tried to do a comprehension dictionary for word_dict, but I don't find a way to do it. I had two problems with it:

I tried to add word_dict[word] += 1 in my comprehension dictionary using word_dict[word]:=word_dict[word]+1
I wanted to check if the element was already in the comprehension dictionary (which I'm creating) using if word not in word_dict and it didn't work.

The comprehension dictionary is:

word_dict = {word_dict[word] := 0 if word not in word_dict else word_dict[word] := word_dict[word] + 1 for word in text_split}

Here is the code, it reads a text and count the different words in it. If you know a better way to do it, just let me know.

text = "hello Hello, water! WATER:HELLO. water , HELLO"

# clean then text
text_cleaned = re.sub(r':|!|,|\.', " ", text)
# Output 'hello Hello  water  WATER HELLO  water   HELLO'

# creates list without spaces elements
text_split = [element for element in text_cleaned.split(' ') if element != '']
# Output ['hello', 'Hello', 'water', 'WATER', 'HELLO', 'water', 'HELLO']

word_dict = {}

for word in text_split:
    if word not in word_dict:
        word_dict[word] = 0 
    word_dict[word] += 1

word_dict
# Output {'hello': 1, 'Hello': 1, 'water': 2, 'WATER': 1, 'HELLO': 2}

Solution

Right now you're using a regex to remove some undesirable characters, and then you split on whitespace to get a list of words. Why not use a regex to get the words right away? You can also take advantage of collections.Counter to create a dictionary, where the keys are words, and the associated values are counts/occurrences:

import re
from collections import Counter

text = "hello Hello, water! WATER:HELLO. water , HELLO"

pattern = r"\b\w+\b"

print(Counter(re.findall(pattern, text)))

Output:

Counter({'water': 2, 'HELLO': 2, 'hello': 1, 'Hello': 1, 'WATER': 1})
>>>

Here's what the regex pattern is composed of:

\b - represents a word boundary (will not be included in the match)
\w+ - one or more characters from the set [a-zA-Z0-9_].
\b - another word boundary (will also not be included in the match)