Search code examples
pythonmultithreadingspawnpexpectdarknet

Darknet multithreading with pexpect


I'm trying to do parallel processing of 4 images using Darknet and Pexpact. The current implementation is similar to this test. It takes 70ms to detect one image, while 300ms is needed to detect 4 in parallel. Am I doing it wrong or do I need more than 1GPU for this processing.

class Darknet: 
  def __init__(self): 
      self.instance = pexpect.spawn(f'darknet detector test {config.OBJ_DATA} {config.YOLOV3} {config.WEIGHTS} -ext_output -dont_show')
      self.instance.delaybeforesend = None
      self.instance.delayafterread = None
      self.instance.expect('Enter Image Path:')
      self.available = True

  def process(self, image_name): 
      self.instance.sendline(image_name)
      self.instance.expect('milli-seconds.', timeout=600)
      self.instance.expect('Enter Image Path:', timeout=600)
      output = self.instance.before
      logging.info(output)

      return output


class PexpectPerformanceTest(unittest.TestCase):
    def test_pexpect_speed(self):
        image_path = "/app/tmp/training_set/125_20231113_100730_609_3.jpg"
        darknet_instance_1 = SimpleDarknetThread(Darknet(), image_path)
        darknet_instance_2 = SimpleDarknetThread(Darknet(), image_path)
        darknet_instance_3 = SimpleDarknetThread(Darknet(), image_path)
        darknet_instance_4 = SimpleDarknetThread(Darknet(), image_path)

        darknet_instance_1.start()
        darknet_instance_2.start()
        darknet_instance_3.start()
        darknet_instance_4.start()

        darknet_instance_1.join()
        darknet_instance_2.join()
        darknet_instance_3.join()
        darknet_instance_4.join()

        print(f"Darknet 1 output: {darknet_instance_1.output}")
        print(f"Darknet 2 output: {darknet_instance_2.output}")
        print(f"Darknet 3 output: {darknet_instance_3.output}")
        print(f"Darknet 4 output: {darknet_instance_4.output}")


class SimpleDarknetThread(threading.Thread):
    def __init__(self, darknet, image_path):
        super().__init__()
        self.darknet = darknet
        self.image_path = image_path
        self.output = None

    def run(self):
        start_time = time.time()
        self.output = self.darknet.process(self.image_path)
        print(f"Required time: {(time.time() - start_time):.2f} s")


if __name__ == "main":
    unittest.main()

Solution

  • I have zero knowledge when it comes to "pexpect". From the code you posted, it seems you are spawning multiple CLI instances of Darknet.

    The main issue is the thing that takes the longest time to run, is loading the weights into the GPU. And now you've multiplied this time by 4!

    The other issue is that loading the weights consumes a lot of vram, something which exists in limited quantity. Depending on the configuration you are using, and the dimensions for which you've trained, you may not have enough vram to load 4 independent copies of your neural network at once.

    Assuming the vram problem isn't an issue, let me show you why loading 4 copies at once to process exactly 1 image each won't help.

    When either Darknet or DarkHelp runs, it outputs the length of time it takes to load the neural network. In my case, I'll load the usual 80-class MSCOCO network which displays the following on the GPU hardware I have:

    Done! Loaded 162 layers from weights-file 
    -> loading network took 1335.175 milliseconds
    

    Processing an image then takes 53.691 additional milliseconds:

    #1/1: loading "artwork/dog.jpg"
    -> prediction took 53.691 milliseconds
    

    So if all the images are the same dimensions, processing should take more-or-less the same amount of time. Meaning 54 milliseconds each. Total time to load this network and process 4 images is 1335 + 54 * 4 = 1551 milliseconds.

    Now if I do this in 4 calls to the Darknet or DarkHelp CLI, we're multiplying the load time by 4, giving us (1335 + 54) * 4 = 5556 milliseconds.

    So obviously, loading multiple copies of Darknet or DarkHelp, and processing 1 image in each instance is not the right solution.

    Instead, if you are using python, what you should be doing is loading the weights once. Then, make repeated calls to process as many images as you need. Both Darknet and DarkHelp have a python API.

    The API should be easy enough to use given the example.py file. For example:

    dh = DarkHelp.CreateDarkHelpNN(cfg_filename, names_filename, weights_filename)
    DarkHelp.SetThreshold(dh, 0.35)
    DarkHelp.SetAnnotationLineThickness(dh, 1)
    DarkHelp.PredictFN(dh, "page_1.png".encode("utf-8"))
    json = DarkHelp.GetPredictionResults(dh)
    

    So the idea is you'd call Predict() or PredictFN() followed by GetPredictionsResults() as many times as you have images. But in each case, you'd only load the weights once, even if you're processing thousands of images.

    Now if you want to get fancy and you have enough vram on your GPU, you could instantiate multiple Darknet or DarkHelp objects. Each of which loads the weights. And then you can use each one in parallel to process multiple images at once. But the thing you need to watch is your vram usage. Running nvidia-smi should tell you how much vram each instance is using and whether you can instantiate another copy.

    Other performance hints are in the YOLO FAQ: https://www.ccoderun.ca/programming/yolo_faq/#fps

    Copied here in case that URL goes down:


    EDIT:

    Take a look at DarkHelp's DHThreads. It loads multiple copies of the same neural network onto the GPU or in memory if using the CPU version of Darknet. This can then be used in parallel to process many image files or video frames at the same time: