Search code examples
pythonpython-3.xlinuxmultiprocessingpython-multiprocessing

How to add data to a json file while making use of multiproccesing?


I'm using a user-friendly json based document oriented database named TinyDB. But I'm unable to add multiple pieces of data to my database because I'm making use of multiproccesing. After a while I get the error that id x already exists in the database (this because 2 or more processes are trying to add data at the same time). Is there any way to solve this?

Every run I insert new unique params.

Example params:

params = {'id' = 1, 'name': 'poop', 'age': 99}

Code:

resultsDb = TinyDB('db/resultsDb.json')

def run(params):
    resultsDb.insert({'id': params['id'], 'name': params['name'], 'age': params['age']})

maxProcesses = 12 # Cores in my pc

for i in range(maxProcesses):
    processes.append(Process(target=run, args=(params,)))

for p in processes:
    p.start()

for p in processes:
    p.join()

Solution

  • I could not test this on a Linux system that I have access to because it is a shared server on which certain facilities required to run the code have been forbidden access. This is a Windows version below. But the key features are:

    1. It uses a Lock to ensure that the insertions are serialized, which I believe is necessary for it to run without error. This, of course, defeats the purpose of parallelizing your code and one can conclude that there is really no point in using multiprocessing or multithreading.
    2. In Windows I did not have to move the resultsDb = TinyDB('db.json') statement to within the run function because on platforms where spawn is used to create new processes, such as Windows, if I had left that statement at global scope it would have been executed anyway for each newly created process. However, for Linux, where fork is used to create new processes, it would not be executed for each new process and instead each new process would inherit the single database that was opened by the main process. This might or might not have worked -- you can try it both ways with the statement at global scope or not. If you put it back at global scope to see if it works there, you do not need the same statement towards the bottom of the source.
    from tinydb import TinyDB
    from multiprocessing import Process, Lock
    
    
    def run(lock, params):
        resultsDb = TinyDB('db/resultsDb.json')
        with lock:
            resultsDb.insert({'id': params['id'], 'name': params['name'], 'age': params['age']})
        print('Successfully inserted.')
    
    # required by Windows:
    if __name__ == '__main__':
        params = {'id': 1, 'name': 'poop', 'age': 99}
    
        maxProcesses = 12 # Cores in my pc
    
        lock = Lock()
        processes = []
        for i in range(maxProcesses):
            processes.append(Process(target=run, args=(lock, params)))
    
        for p in processes:
            p.start()
    
        for p in processes:
            p.join()
    
        # remove the following if the first one is at global scope:
        resultsDb = TinyDB('db/resultsDb.json')
        print(resultsDb.all())
    

    Prints:

    Successfully inserted.
    Successfully inserted.
    Successfully inserted.
    Successfully inserted.
    Successfully inserted.
    Successfully inserted.
    Successfully inserted.
    Successfully inserted.
    Successfully inserted.
    Successfully inserted.
    Successfully inserted.
    Successfully inserted.
    [{'id': 1, 'name': 'poop', 'age': 99}, {'id': 1, 'name': 'poop', 'age': 99}, {'id': 1, 'name': 'poop', 'age': 99}, {'id': 1, 'name': 'poop', 'age': 99}, {'id': 1, 'name': 'poop', 'age': 99}, {'id': 1, 'name': 'poop', 'age': 99}, {'id': 1, 'name': 'poop', 'age': 99}, {'id': 1, 'name': 'poop', 'age': 99}, {'id': 1, 'name': 'poop', 'age': 99}, {'id': 1, 'name': 'poop', 'age': 99}, {'id': 1, 'name': 'poop', 'age': 99}, {'id': 1, 'name': 'poop', 'age': 99}]