I'm doing a lot of calculations writing the results to a file. Using multiprocessing I'm trying to parallelise the calculations.
Problem here is that I'm writing to one output file, which all the workers are writing too. I'm quite new to multiprocessing, and wondering how I could make it work.
A very simple concept of the code is given below:
from multiprocessing import Pool
fout_=open('test'+'.txt','w')
def f(x):
fout_.write(str(x) + "\n")
if __name__ == '__main__':
p = Pool(5)
p.map(f, [1, 2, 3])
The result I want would be a file with:
1 2 3
However now I get an empty file. Any suggestions? I greatly appreciate any help :)!
You shouldn't be letting all the workers/processes write to a single file. They can all read from one file (which may cause slow downs due to workers waiting for one of them to finish reading), but writing to the same file will cause conflicts and potentially corruption.
Like said in the comments, write to separate files instead and then combine them into one on a single process. This small program illustrates it based on the program in your post:
from multiprocessing import Pool
def f(args):
''' Perform computation and write
to separate file for each '''
x = args[0]
fname = args[1]
with open(fname, 'w') as fout:
fout.write(str(x) + "\n")
def fcombine(orig, dest):
''' Combine files with names in
orig into one file named dest '''
with open(dest, 'w') as fout:
for o in orig:
with open(o, 'r') as fin:
for line in fin:
fout.write(line)
if __name__ == '__main__':
# Each sublist is a combination
# of arguments - number and temporary output
# file name
x = range(1,4)
names = ['temp_' + str(y) + '.txt' for y in x]
args = list(zip(x,names))
p = Pool(3)
p.map(f, args)
p.close()
p.join()
fcombine(names, 'final.txt')
It runs f
for each argument combination which in this case are value of x and temporary file name. It uses a nested list of argument combinations since pool.map
does not accept more than one arguments. There are other way to go around this, especially on newer Python versions.
For each argument combination and pool member it creates a separate file to which it writes the output. In principle your output will be longer, you can simply add another function that computes it to the f
function. Also, no need to use Pool(5) for 3 arguments (though I assume that only three workers were active anyway).
Reasons for calling close()
and join()
are explained well in this post. It turns out (in the comment to the linked post) that map
is blocking, so here you don't need them for the original reasons (wait till they all finish and then write to the combined output file from just one process). I would still use them in case other parallel features are added later.
In the last step, fcombine
gathers and copies all the temporary files into one. It's a bit too nested, if you for instance decide to remove the temporary file after copying, you may want to use a separate function under the with open('dest', )..
or the for loop underneath - for readability and functionality.