Search code examples
pythonpython-2.7nested-loopspython-multiprocessing

How to parallelize this nested loop in python


I am trying to improve the performance of my code and can't figure out how to implement multiprocessing module in it.

I am using linux (CentOS 7.2) and python 2.7

The code that I need to run in a parallel environment:

def start_fetching(directory):
    with open("test.txt", "a") as myfile:
        try:
            for dirpath, dirnames, filenames in os.walk(directory):
                for current_file in filenames:
                    current_file = dirpath + "/" + current_file
                    myfile.write(current_file)
            return 0
        except:
            return sys.exc_info()[0]

if __name__ == "__main__":
    cwd = "/home/"
    final_status = start_fetching(cwd)
    exit(final_status)

I need to save the meta-data of all the files (here, only filename is shown) in a database. Here I am only storing the file name in a text file.


Solution

  • Thank you all for helping me reduce this script's processing time to almost half. (I am adding this as an answer as I can't add this much content in a comment)

    I found two ways to achieve what I wished for:

    1. Using this link mentioned by @KeerthanaPrabhakaran, Which is concerned with the multi-threading.

      def worker(filename):
          subprocess_out = subprocess.Popen(["stat", "-c",
                                     "INSERT INTO file VALUES (NULL, \"%n\", '%F', %s, %u, %g, datetime(%X, 'unixepoch', 'localtime'), datetime(%Y, 'unixepoch', 'localtime'), datetime(%Z, 'unixepoch', 'localtime'));", filename], stdout=subprocess.PIPE)
          return subprocess_out.communicate()[0]
      
      def start_fetching(directory, threads):
          filename = fetch_filename() + ".txt"
          with contextlib.closing(multiprocessing.Pool(threads)) as pool:   # pool of threads processes
              with open(filename, "a") as myfile:
                  walk = os.walk(directory)
                  fn_gen = itertools.chain.from_iterable((os.path.join(root, file) for file in files) for root, dirs, files in walk)
      
                  results_of_work = pool.map(worker, fn_gen)  # this does the parallel processing
                  print "Concatenating the result into the text file"
                  for result in results_of_work:
                      myfile.write(str(result))
          return filename
      

      This is traversing 15203 files in 0m15.154s.

    2. The Second one, That @ArunKumar mentioned, was related to multiprocessing:

      def task(filename, process_no, return_dict):
          subprocess_out = subprocess.Popen(["stat", "-c",
                                     "INSERT INTO file VALUES (NULL, \"%n\", '%F', %s, %u, %g, datetime(%X, 'unixepoch', 'localtime'), datetime(%Y, 'unixepoch', 'localtime'), datetime(%Z, 'unixepoch', 'localtime'));",
                                     filename], stdout=subprocess.PIPE)
          return_dict[process_no] = subprocess_out.communicate()[0]
      
      
      def start_fetching_1(directory):
          try:
              processes = []
              i = 0
              manager = multiprocessing.Manager()
              return_dict = manager.dict()
      
              for dirpath, dirnames, filenames in os.walk(directory):
                  for current_file in filenames:
                      current_file = dirpath + "/" + current_file
                      # Create Seperate process and do what you want, becausee Multi-threading wont help in parallezing
                      p = multiprocessing.Process(target=task, args=(current_file, i, return_dict))
                      i += 1
                      p.start()
                      processes.append(p)
      
              # Let all the child processes finish and do some post processing if needed.
              for process in processes:
                  process.join()
      
              with open("test.txt", "a") as myfile:
                  myfile.write(return_dict.values())
      
              return 0
          except:
              return sys.exc_info()[0]
      

      This is traversing 15203 files in 1m12.197s

    I don't understand why multiprocessing is taking that much time (my initial code was taking 0m27.884s only), but utilizing almost 100% CPU.

    The above Codes are exact codes that I am running, (I am storing these info in a file and than use these test.txt file to create database entries)

    I am trying to optimize the above code further, but can't think of a better way, as @CongMa mentioned, it might have finally come to the I/O bottleneck.