Search code examples
pythonnumpyvariablestimemultiprocessing

Creating plots with multiprocessing and time.strftime() doens't work properly


I am trying to create plots with my script running parallel using multiprocessing. I created 2 example scripts for my question here, because the actual main script with the computing part would be too long. In script0.py you can see the multiprocessing part where im starting the actual script1.py that does something 4 times in parallel. In this example it just creates some random scatterplots.

script0.py:

import multiprocessing as mp
import os

def execute(process):
    os.system(f"python {process}")



if __name__ == "__main__":

    proc_num = 4
    process= []

    for _ in range(proc_num):
        process.append("script1.py")

    process_pool = mp.Pool(processes= proc_num)
    process_pool.map(execute, process)

script1.py:

#just a random scatterplot, but works for my example
    import time
    import numpy as np
    import matplotlib.pyplot as plt
    import os
    
    dir_name = "stackoverflow_question"
    plot_name = time.strftime("Plot %Hh%Mm%Ss")      #note the time.strftime() function
    
    if not os.path.exists(f"{dir_name}"):
        os.mkdir(f"{dir_name}")
    
    N = 50
    x = np.random.rand(N)
    y = np.random.rand(N)
    colors = np.random.rand(N)
    
    area = (30 * np.random.rand(N))**2
    
    plt.scatter(x,y, s=area, c=colors, alpha=0.5)
    #plt.show()
    plt.savefig(f"{dir_name}/{plot_name}", dpi = 300)

The important thing is, that I am naming the plot by its creation time

plot_name = time.strftime("Plot %Hh%Mm%Ss")

So this creates a string like "Plot 16h39m22s". So far so good... now to my actual problem! I realized that when starting the processes in parallel, sometimes the plot names are the same because the time stamps created by time.strftime() are the same and so it can happen that one instance of script1.py overwrites the already created plot of another.

In my working script where I have this exact problem I'm generating a lot of data therefore i need to name my plots and CSVs accordingly to the date and time they were generated.

I already thought of giving a variable down to script1.py when it gets called, but I don't know how to realize that since I just learned about the multiprocessing library. But this variable had to vary as well, otherwise I would run into the same problem.

Does anybody have a better idea of how I could realize this? Thank you so much in advance.


Solution

  • I propose these approaches:

    • Approach 1: (simple and recommended) if you can change the name, I recommend using unixtime (eg. using time.time() or time.time_ns()) instead of date or adding decimals to the seconds. This way you would make a collision almost impossible.
    • Approach 2: Add the process id in the filename (eg: <filename_timestamp_processid>). This way even if two processes write at the same time you will have the process id distinguishing the files. If you want to remove the id from the name at the end of execution read the filenames and do a merge, if there are collisions adjust the filename in the appropriate way.
    • Approach 3: like approach2, but instead of changing the name you create a folder named after the process id in which you put the outputs of that process. At the end of execution you merge the folders and correct any collisions.
    • Approach 4: (not recommended, difficult to manage and affects performance) shared memory. You use a variable in shared memory with the last timestamp and check that.