Search code examples
pythonfile-read

python multiprocessing read file cost too much time


there is a function in my code that should read the file .each file is about 8M,however the reading speed is too low,and to improve that i use the multiprocessing.sadly,it seems it got blocked.i wanna know is there any methods to help solve this and improve the reading speed?

my code is as follows:

import multiprocessing as mp
import json
import os

def gainOneFile(filename):

    file_from = open(filename)
    json_str = file_from.read()
    temp = json.loads(json_str)
    print "load:",filename," len ",len(temp)
    file_from.close()
    return temp

def gainSortedArr(path):
    arr = []
    pool = mp.Pool(4)
    for i in xrange(1,40):
        abs_from_filename = os.path.join(path, "outputDict"+str(i))
        result = pool.apply_async(gainOneFile,(abs_from_filename,)) 
        arr.append(result.get())

    pool.close()
    pool.join()                                               
    arr = sorted(arr,key = lambda dic:len(dic))

    return arr

and the call function:

whole_arr = gainSortedArr("sortKeyOut/")  

Solution

  • You have a few problems. First, you're not parallelizing. You do:

    result = pool.apply_async(gainOneFile,(abs_from_filename,)) 
    arr.append(result.get())
    

    over and over, dispatching a task, then immediately calling .get() which waits for it to complete before you dispatch any additional tasks; you never actually have more than one worker running at once. Store all the results without calling .get(), then call .get() later. Or just use Pool.map or related methods and save yourself some hassle from manual individual result management, e.g. (using imap_unordered to minimize overhead since you're just sorting anyway):

    # Make generator of paths to load
    paths = (os.path.join(path, "outputDict"+str(i)) for i in xrange(1, 40))
    # Load them all in parallel, and sort the results by length (lambda is redundant)
    arr = sorted(pool.imap_unordered(gainOneFile, paths), key=len)
    

    Second, multiprocessing has to pickle and unpickle all arguments and return values sent between the main process and the workers, and it's all sent over pipes that incur system call overhead to boot. Since your file system isn't likely to gain substantial speed from parallelizing the reads, it's likely to be a net loss, not a gain.

    You might be able to get a bit of a boost by switching to a thread based pool; change the import to import multiprocessing.dummy as mp and you'll get a version of Pool implemented in terms of threads; they don't work around the CPython GIL, but since this code is almost certainly I/O bound, that hardly matters, and it removes the pickling and unpickling as well as the IPC involved in worker communications.

    Lastly, if you're using Python 3.3 or higher on a UNIX like system, you may be able to get the OS to help you out by having it pull files into the system cache more aggressively. If you can open the file, then use os.posix_fadvise on the file descriptor (.fileno() on file objects) with either WILLNEED or SEQUENTIAL it might improve read performance when you read from the file at some later point by aggressively prefetching file data before you request it.