python multithreading spawn pexpect darknet

Darknet multithreading with pexpect

I'm trying to do parallel processing of 4 images using Darknet and Pexpact. The current implementation is similar to this test. It takes 70ms to detect one image, while 300ms is needed to detect 4 in parallel. Am I doing it wrong or do I need more than 1GPU for this processing.

class Darknet: 
  def __init__(self): 
      self.instance = pexpect.spawn(f'darknet detector test {config.OBJ_DATA} {config.YOLOV3} {config.WEIGHTS} -ext_output -dont_show')
      self.instance.delaybeforesend = None
      self.instance.delayafterread = None
      self.instance.expect('Enter Image Path:')
      self.available = True

  def process(self, image_name): 
      self.instance.sendline(image_name)
      self.instance.expect('milli-seconds.', timeout=600)
      self.instance.expect('Enter Image Path:', timeout=600)
      output = self.instance.before
      logging.info(output)

      return output


class PexpectPerformanceTest(unittest.TestCase):
    def test_pexpect_speed(self):
        image_path = "/app/tmp/training_set/125_20231113_100730_609_3.jpg"
        darknet_instance_1 = SimpleDarknetThread(Darknet(), image_path)
        darknet_instance_2 = SimpleDarknetThread(Darknet(), image_path)
        darknet_instance_3 = SimpleDarknetThread(Darknet(), image_path)
        darknet_instance_4 = SimpleDarknetThread(Darknet(), image_path)

        darknet_instance_1.start()
        darknet_instance_2.start()
        darknet_instance_3.start()
        darknet_instance_4.start()

        darknet_instance_1.join()
        darknet_instance_2.join()
        darknet_instance_3.join()
        darknet_instance_4.join()

        print(f"Darknet 1 output: {darknet_instance_1.output}")
        print(f"Darknet 2 output: {darknet_instance_2.output}")
        print(f"Darknet 3 output: {darknet_instance_3.output}")
        print(f"Darknet 4 output: {darknet_instance_4.output}")


class SimpleDarknetThread(threading.Thread):
    def __init__(self, darknet, image_path):
        super().__init__()
        self.darknet = darknet
        self.image_path = image_path
        self.output = None

    def run(self):
        start_time = time.time()
        self.output = self.darknet.process(self.image_path)
        print(f"Required time: {(time.time() - start_time):.2f} s")


if __name__ == "main":
    unittest.main()

Solution

I have zero knowledge when it comes to "pexpect". From the code you posted, it seems you are spawning multiple CLI instances of Darknet.

The main issue is the thing that takes the longest time to run, is loading the weights into the GPU. And now you've multiplied this time by 4!

The other issue is that loading the weights consumes a lot of vram, something which exists in limited quantity. Depending on the configuration you are using, and the dimensions for which you've trained, you may not have enough vram to load 4 independent copies of your neural network at once.

Assuming the vram problem isn't an issue, let me show you why loading 4 copies at once to process exactly 1 image each won't help.

When either Darknet or DarkHelp runs, it outputs the length of time it takes to load the neural network. In my case, I'll load the usual 80-class MSCOCO network which displays the following on the GPU hardware I have:

Done! Loaded 162 layers from weights-file 
-> loading network took 1335.175 milliseconds

Processing an image then takes 53.691 additional milliseconds:

#1/1: loading "artwork/dog.jpg"
-> prediction took 53.691 milliseconds

So if all the images are the same dimensions, processing should take more-or-less the same amount of time. Meaning 54 milliseconds each. Total time to load this network and process 4 images is 1335 + 54 * 4 = 1551 milliseconds.

Now if I do this in 4 calls to the Darknet or DarkHelp CLI, we're multiplying the load time by 4, giving us (1335 + 54) * 4 = 5556 milliseconds.

So obviously, loading multiple copies of Darknet or DarkHelp, and processing 1 image in each instance is not the right solution.

Instead, if you are using python, what you should be doing is loading the weights once. Then, make repeated calls to process as many images as you need. Both Darknet and DarkHelp have a python API.

Darknet python API: https://github.com/hank-ai/darknet/tree/master/src-python
DarkHelp python API: https://github.com/stephanecharette/DarkHelp/tree/master/src-python

The API should be easy enough to use given the example.py file. For example:

dh = DarkHelp.CreateDarkHelpNN(cfg_filename, names_filename, weights_filename)
DarkHelp.SetThreshold(dh, 0.35)
DarkHelp.SetAnnotationLineThickness(dh, 1)
DarkHelp.PredictFN(dh, "page_1.png".encode("utf-8"))
json = DarkHelp.GetPredictionResults(dh)

So the idea is you'd call Predict() or PredictFN() followed by GetPredictionsResults() as many times as you have images. But in each case, you'd only load the weights once, even if you're processing thousands of images.

Now if you want to get fancy and you have enough vram on your GPU, you could instantiate multiple Darknet or DarkHelp objects. Each of which loads the weights. And then you can use each one in parallel to process multiple images at once. But the thing you need to watch is your vram usage. Running nvidia-smi should tell you how much vram each instance is using and whether you can instantiate another copy.

Other performance hints are in the YOLO FAQ: https://www.ccoderun.ca/programming/yolo_faq/#fps

Copied here in case that URL goes down:

Probably the biggest impact on FPS is the configuration you use. See What configuration file should I use? at the top of this FAQ.
The network dimensions. The larger the dimensions, the slower it will be. See Does the network have to be perfectly square? and What is the optimal network size? at the top of this FAQ.
Whether your video frames or images need to be resized due to the network dimensions you are using. Resizing video frames is very expensive.
The hardware you use. Don't attempt to use the CPU. Get a GPU that has CUDA support.
Whether you are using Darknet+CUDA, or OpenCV DNN+CUDA.
Prefer the C or C++ API over using Python. ("Statistically, C++ is 400 times faster than Python [...])

EDIT:

Take a look at DarkHelp's DHThreads. It loads multiple copies of the same neural network onto the GPU or in memory if using the CPU version of Darknet. This can then be used in parallel to process many image files or video frames at the same time:

Doxygen docs: https://www.ccoderun.ca/darkhelp/api/classDarkHelp_1_1DHThreads.html#details
Github repo: https://github.com/stephanecharette/DarkHelp/blob/master/src-lib/DarkHelpThreads.hpp