Search code examples
pythonsumtext-files

How can I sum integers from multiple text files into a new text file using python?


I am currently trying to take ten different text files (file2_0.txt, file2_1.txt,file2_2.txt,...) all containing one column and one hundred million rows of random integers and add the text files row by row. I want to add every row from all ten files together and generate a new text file (total_file.txt) with the sum of each row. Below is an example of what I am trying to do using two of the files added together to create total_file.txt.

file2_0.txt

5
19
51
10
756

file2_1.txt

11
43
845
43
156

total_file.txt

16
62
896
53
912

Since these files are rather large I am not trying to read them into memory and instead use concurrency. I found sample code from another StackOverflow (Python : Sum of numbers in different files) question that I was trying before doing all of the files at one time. The problem I am having is that the output (total_file.txt) only contains the numbers from the second text file (file2_1.txt) with nothing added. I am not sure why this is. I am new to Stackoverflow and coding in general and wanted to ask about this on the linked post, however, I read online that is not good practice. Below is the code I worked on.

import shutil
#Files to add
filenames = ['file2_0.txt', 'file2_1.txt']`
sums = []

with open('file2_0.txt') as file:
    for row in file:
        sums.append(row.split())
#Create output file
with open('total_file.txt', 'wb') as wfd:
    for file in filenames:
        with open(file) as open_file:
            for i, row in enumerate(open_file):
                sums[i] = sums[i]+row.split()

    with open(file, 'rb') as fd:
        shutil.copyfileobj(fd, wfd)

Just for background, I am working with these large files to test processing speeds. Once I get an understanding of what I am doing wrong I will be working on parallel processing specifically multithreading to test the various process speeds. Please let me know what further information you might need from me.


Solution

  • I'd use generators so you don't have to load all of the files into memory at once (in case they're large)

    Then just pull the next value from each generator, sum them, write them and carry on. When you hit the end of the file you'll get a StopIteration exception and be done

    def read_file(file):
      with open(file, "r") as inFile:
        for row in inFile:
          yield row
    
    file_list = ["file1.txt", "file2.txt", ..., "file10.txt"]
    file_generators = [read_file(path) for path in file_list]
    
    with open("totals.txt", "w+") as outFile:
      while True
        try:
          outFile.write(f"{sum([int(next(gen)) for gen in file_generators])}\n")
        except StopIteration:
          break