So I am trying to find one or more strings in each line of a file, and count the number of times each string comes up in total in the file. In some lines there is only one of the strings, however in other lines there may be multiple target strings, if that makes sense. I am trying to use a regular expression to do this.
So what I've tried is as follows (having already read the file in and separated it into lines using .readlines):
1count=0
2count=0
3count=0
Pattern=r'(?i)(\bString1\b)|(\bString2\b)|(\bString3\b)'
i=0
while i!=len(lines)
match=re.search(pattern, lines[i])
if match:
if match.group(1):
1count=1count+1
elif match.group(2):
2count=2count+1
elif match.group(3):
3count=3count+1
i=i+1
This works when there is no multiple matches in the line, however when there is it obviously only counts the first match and then moves on. Is there a way for me to scan the whole line anyway? I know re.findall finds all matches, but it then puts them into an array, and I don't know how I would reliably count the number of matches for each word, since the matches in findall would have different indexes in the array each loop through.
In your example, the matches are all static strings, so you can just use them as dictionary keys for a Counter object.
import re
from collections import Counter
count = Counter()
for line in lines:
for match in re.finditer(Pattern, line):
count.update(match.group(0))
for k in count.keys():
print(f"{c[k]} occurrences of {k}")
Part of the useful changes here is using re.finditer()
instead of re.findall
which returns a proper re.Match
object from which you can extract the matching string with .group(0)
as well as various other attributes, should you wish to.
If you need to extract matches which could contain variations, like r"c[ei]*ling"
or r"\d+"
, you can't use the matched strings as dictionary keys (because then the Counter
would count each unique string as a separate entity; so you would get "12 occurrences of 123" and "1 occurrence of 234" instead of "13 occurrences of \d+
"); in that case, I would perhaps try to use named subgroups.
for match in re.finditer(r"(?P<ceiling>c[ei]*ling)|(?P<number>\d+)", line):
matches = match.groupdict()
for key in matches.keys():
if matches[key] is not None:
count.update(key)