I am currently trying to take ten different text files (file2_0.txt, file2_1.txt,file2_2.txt,...) all containing one column and one hundred million rows of random integers and add the text files row by row. I want to add every row from all ten files together and generate a new text file (total_file.txt) with the sum of each row. Below is an example of what I am trying to do using two of the files added together to create total_file.txt.
file2_0.txt
5
19
51
10
756
file2_1.txt
11
43
845
43
156
total_file.txt
16
62
896
53
912
Since these files are rather large I am not trying to read them into memory and instead use concurrency. I found sample code from another StackOverflow (Python : Sum of numbers in different files) question that I was trying before doing all of the files at one time. The problem I am having is that the output (total_file.txt) only contains the numbers from the second text file (file2_1.txt) with nothing added. I am not sure why this is. I am new to Stackoverflow and coding in general and wanted to ask about this on the linked post, however, I read online that is not good practice. Below is the code I worked on.
import shutil
#Files to add
filenames = ['file2_0.txt', 'file2_1.txt']`
sums = []
with open('file2_0.txt') as file:
for row in file:
sums.append(row.split())
#Create output file
with open('total_file.txt', 'wb') as wfd:
for file in filenames:
with open(file) as open_file:
for i, row in enumerate(open_file):
sums[i] = sums[i]+row.split()
with open(file, 'rb') as fd:
shutil.copyfileobj(fd, wfd)
Just for background, I am working with these large files to test processing speeds. Once I get an understanding of what I am doing wrong I will be working on parallel processing specifically multithreading to test the various process speeds. Please let me know what further information you might need from me.
I'd use generators so you don't have to load all of the files into memory at once (in case they're large)
Then just pull the next value from each generator, sum them, write them and carry on. When you hit the end of the file you'll get a StopIteration exception and be done
def read_file(file):
with open(file, "r") as inFile:
for row in inFile:
yield row
file_list = ["file1.txt", "file2.txt", ..., "file10.txt"]
file_generators = [read_file(path) for path in file_list]
with open("totals.txt", "w+") as outFile:
while True
try:
outFile.write(f"{sum([int(next(gen)) for gen in file_generators])}\n")
except StopIteration:
break