Search code examples
pythonpython-multiprocessing

Processing list of dictionaries via python multiprocessing


I'm processing a list of dictionaries in python like so:

def process_results(list_of_dicts):
    first_result, second_result, count = [], [], 0
    for dictionary in list_of_dicts:
        first_result.append(dictionary)
        if 'pi' in dictionary:
            second_result.append(dictionary)
        count += 1
    print second_result, first_result

Next, via this simple SO example of using multiprocessing in a for loop, I'm trying the following (to completely erroneous results):

    from multiprocessing import Pool

    def process_results(list_of_dicts):
        first_result, second_result, count = [], [], 0
        for dictionary in list_of_dicts:
            first_result.append(dictionary)
            if 'pi' in dictionary:
                second_result.append(dictionary)
            count += 1
        return second_result, first_result

    if __name__ == '__main__':
        list_of_dictionaries = # a list of dictionaries
        pool = Pool()
        print pool.map(process_results, list_of_dictionaries)

Why is this wrong? An illustrative example would be nice.


Solution

  • What you're probably looking for is this

    from multiprocessing import Pool
    
    def process_results(single_dict):
        first_result, second_result, count = [], [], 0
        first_result.append(single_dict)
        if 'pi' in single_dict:
            second_result.append(single_dict)
            count += 1
        return first_result, second_result
    
    if __name__ == '__main__':
        lst_dict = [{'a':1, 'b':2, 'c':3},{'c':4, 'pi':3.14}, {'pi':'3.14', 'not pi':8.3143}, {'sin(pi)': 0, 'cos(pi)': 1}];
        pool = Pool()
        print pool.map(process_results, lst_dict)
    

    pool.map executes process_results for each element in the iterable lst_dict. Since lst_dict is a list of dictionaries that means that process_results will be called for every dictionary in lst_dict using it as an argument. process_results will be processing every dictionary rather than the whole list.

    process_results in this program is changed accordingly: for a given dictionary in the list, it appends the dictionary to the first_result list and then appends the dictionary to the second_result list if the 'pi' key exist. Result is a list with two sublists - one containing the dictionary and one containing either the copy of the first or an empty list if no 'pi' was found.

    All this can be modified if you for instance need the first_result and second_result lists to be shared among processes.

    For a better picture of how pool.map() works look at the first example in the documentation.

    To retrieve the results in their original/target form of two lists you can collect the data into a list and then process it:

    results = []
    results = pool.map(process_results, lst_dict)
    
    first_result = [i[0][0] for i in results]
    second_result = [i[0][0] for i in results if i[1]]
    

    results is a list of tuples. The tuples represent the result of processing of each dictionary - first element is the whole dictionary and the second is either an empty list, or the whole dictionary if 'pi' key was found. Remaining two lines retrieve that data into first_result and second_result lists.