I encountered this issue when I was trying to extract the numbers from a file. It prevented me from solving it myself because it was difficult to find the right regex patterns to extract the numbers because it is complex.I wrote a program to extract the numbers from a file and calculate the sum. However, I encountered some numbers are missing in the total or lost count, so the sum was wrong. The total number of lines was wrong, so the total was wrong. I encountered difficulties finding the correct regex patterns to extract the integers from this file. I tried many ways to fix it.
File: http://py4e-data.dr-chuck.net/regex_sum_42.txt
import re
name = open('Sample data.txt')
sum = 0
count = 0
for line in name:
line = line.rstrip()
if line.isdigit():
y2 = re.findall('[0-9]+',line)
sum = sum + int(y2[0])
count = count + 1
else:
continue
print(y2,sum,count)
I tried to:
Desired result: There are 90 values with a sum=445833
Your main issue is in this line:
if line.isdigit():
which prevents anything happening unless every character in line
is a digit. You don't actually need this, as your regex match will ensure you only find numeric values in each line.
Your other issue is the use of sum
as a variable name, as that will prevent use of the inbuilt function sum
, which is useful for this problem. Change it to something like total
instead.
This code should do what you want:
total = 0
count = 0
with open('Sample data.txt') as file:
for line in file:
y2 = re.findall(r'\d+', line)
total += sum(map(int, y2))
count += len(y2)
print(f'{count} values summing to {total}')
For your sample text file, this gives:
90 values summing to 445833
as desired.
Note that if you need to deal with (possibly) signed numbers, you should change the regex to
[+-]?\d+