Search code examples
pythonperformancedictionarymultiprocessingxlrd

Shared memory dictionary creation too slow using multiprocessing.Manager()


I have a code in which I need to read an excel file and store the information into dictionaries.

I have to use multiprocessing.Manager() to create the dictionaries in order to be able to retrieve calculation output from a function that I run using multiprocess.Process.

The problem is that, when multiprocessing.Manager() and manager.dict() is used to create a dictionary it takes ~400 times longer than using only dict() (and dict() is not a shared memory structure).

Here is a sample code to verify the diference:

import xlrd
import multiprocessing
import time

def DictManager(inp1, inp2):
    manager = multiprocessing.Manager()
    Dict = manager.dict()
    Dict['input1'] = inp1
    Dict['input2'] = inp2
    Dict['Output1'] = None
    Dict['Output2'] = None
    return Dict

def DictNoManager(inp1, inp2):
    Dict = dict()
    Dict['input1'] = inp1
    Dict['input2'] = inp2
    Dict['Output1'] = None
    Dict['Output2'] = None
    return Dict

def ReadFileManager(excelfile):
    DictList = []
    book = xlrd.open_workbook(excelfile)
    sheet = book.sheet_by_index(0)
    line = 2
    for line in range(2,sheet.nrows):
        inp1 = sheet.cell(line,2).value
        inp2 = sheet.cell(line,3).value
        dictionary = DictManager(inp1, inp2)
        DictList.append(dictionary)
    print 'Done!'

def ReadFileNoManager(excelfile):
    DictList = []
    book = xlrd.open_workbook(excelfile)
    sheet = book.sheet_by_index(0)
    line = 2
    for line in range(2,sheet.nrows):
        inp1 = sheet.cell(line,2).value
        inp2 = sheet.cell(line,3).value
        dictionary = DictNoManager(inp1, inp2)
        DictList.append(dictionary)
    print 'Done!'


if __name__ == '__main__':
    excelfile = 'MyFile.xlsx'

    start = time.time()
    ReadFileNoManager(excelfile)
    end = time.time()
    print 'Run time NoManager:', end - start, 's'

    start = time.time()
    ReadFileManager(excelfile)
    end = time.time()
    print 'Run time Manager:', end - start, 's'

Is there a way to improve the performance of multiprocessing.Manager()?

If the answer is No, is there any other shared memory structure that I can use to replace what I am doing and improve performance?

I would appreciate your help!

EDIT:

My main function uses the following code:

def MyFunction(Dictionary, otherdata):
    #Perform calculation and save results in the dictionary
    Dict['Output1'] = Value1
    Dict['Output2'] = Value2

ListOfProcesses = []
for Dict in DictList:
    p = multiprocessing.Process(target=MyFunction, args=(Dict, otherdata)
    p.start()
    ListOfProcesses.append(p)  
for p in ListOfProcesses:
    p.join()

If I do not use the manager, I will not be able to retrieve the Outputs.


Solution

  • As I mentioned in the comments, I recommend using the main process to read in the excel file. Then using multiprocessing for the function calls. Just add your function to apply_function and make sure it returns whatever you want. results will contain a list of your results.

    Update: I changed map to starmap to include your extra argument

    def ReadFileNoManager(excelfile):
        DictList = []
        book = xlrd.open_workbook(excelfile)
        sheet = book.sheet_by_index(0)
        line = 2
        for line in range(2,sheet.nrows):
            inp1 = sheet.cell(line,2).value
            inp2 = sheet.cell(line,3).value
            dictionary = DictNoManager(inp1, inp2)
            DictList.append(dictionary)
        print 'Done!'
        return DictList
    
    def apply_function(your_dict, otherdata):
        pass
    
    if __name__ == '__main__':
        excelfile = 'MyFile.xlsx'
        dict_list = ReadFileNoManager(excelfile)    
        pool = multiprocessing.Pool(multiprocessing.cpu_count())
        results = pool.starmap(apply_function, zip(dict_list, repeat(otherdata)))