Search code examples
python-3.xregexpython-re

Extract all the numbers in a file and compute the sum of the numbers


I encountered this issue when I was trying to extract the numbers from a file. It prevented me from solving it myself because it was difficult to find the right regex patterns to extract the numbers because it is complex.I wrote a program to extract the numbers from a file and calculate the sum. However, I encountered some numbers are missing in the total or lost count, so the sum was wrong. The total number of lines was wrong, so the total was wrong. I encountered difficulties finding the correct regex patterns to extract the integers from this file. I tried many ways to fix it.

File: http://py4e-data.dr-chuck.net/regex_sum_42.txt

import re
name = open('Sample data.txt')
sum = 0
count = 0
for line in name:
line = line.rstrip()
if line.isdigit():
    y2 = re.findall('[0-9]+',line)
    sum = sum + int(y2[0])
    count = count + 1
else:
    continue
print(y2,sum,count)

I tried to:

  1. Opened and read the file provided.
  2. Imported regular expressions from the python library.
  3. Initial and calculated the sum but it was wrong.
  4. Extracted the integers from the file using regex pattern ([0-9]+)
  5. Created a count to calculate the total lines or values to sum up but it was wrong.
  6. ([0-9]+)- filtered all the digits in the file. However, it cannot produce the result because the numbers were inserted in any parts of the file, so some numbers were missed.

Desired result: There are 90 values with a sum=445833


Solution

  • Your main issue is in this line:

    if line.isdigit():
    

    which prevents anything happening unless every character in line is a digit. You don't actually need this, as your regex match will ensure you only find numeric values in each line.

    Your other issue is the use of sum as a variable name, as that will prevent use of the inbuilt function sum, which is useful for this problem. Change it to something like total instead.

    This code should do what you want:

    total = 0
    count = 0
    with open('Sample data.txt') as file:
        for line in file:
            y2 = re.findall(r'\d+', line)
            total += sum(map(int, y2))
            count += len(y2)
    
    print(f'{count} values summing to {total}')
    

    For your sample text file, this gives:

    90 values summing to 445833
    

    as desired.

    Note that if you need to deal with (possibly) signed numbers, you should change the regex to

    [+-]?\d+