I'm learning Python by myself, I'm starting to refactor Python code to learn new and efficient ways to code.
I tried to do a comprehension dictionary for word_dict
, but I don't find a way to do it. I had two problems with it:
word_dict[word] += 1
in my comprehension dictionary using word_dict[word]:=word_dict[word]+1
if word not in word_dict
and it didn't work.The comprehension dictionary is:
word_dict = {word_dict[word] := 0 if word not in word_dict else word_dict[word] := word_dict[word] + 1 for word in text_split}
Here is the code, it reads a text and count the different words in it. If you know a better way to do it, just let me know.
text = "hello Hello, water! WATER:HELLO. water , HELLO"
# clean then text
text_cleaned = re.sub(r':|!|,|\.', " ", text)
# Output 'hello Hello water WATER HELLO water HELLO'
# creates list without spaces elements
text_split = [element for element in text_cleaned.split(' ') if element != '']
# Output ['hello', 'Hello', 'water', 'WATER', 'HELLO', 'water', 'HELLO']
word_dict = {}
for word in text_split:
if word not in word_dict:
word_dict[word] = 0
word_dict[word] += 1
word_dict
# Output {'hello': 1, 'Hello': 1, 'water': 2, 'WATER': 1, 'HELLO': 2}
Right now you're using a regex to remove some undesirable characters, and then you split on whitespace to get a list of words. Why not use a regex to get the words right away? You can also take advantage of collections.Counter
to create a dictionary, where the keys are words, and the associated values are counts/occurrences:
import re
from collections import Counter
text = "hello Hello, water! WATER:HELLO. water , HELLO"
pattern = r"\b\w+\b"
print(Counter(re.findall(pattern, text)))
Output:
Counter({'water': 2, 'HELLO': 2, 'hello': 1, 'Hello': 1, 'WATER': 1})
>>>
Here's what the regex pattern is composed of:
\b
- represents a word boundary (will not be included in the match)\w+
- one or more characters from the set [a-zA-Z0-9_]
.\b
- another word boundary (will also not be included in the match)