C++ is quite slower than python in opencv

startTime = time.time()
blob = cv2.dnn.blobFromImage(img, float(1.0/255.0), (frameWidth,frameHeight), (0,0,0), swapRB = True, crop = False)
yolo.setInput(blob)
layerOutput = yolo.forward(outputLayers)
endTime = time.time()

Python code that I am measuring the time

auto start = chrono::steady_clock::now();
blob = blobFromImage(images[i], 1.0f/255.0f, Size(frameWidth, frameHeight), Scalar(0,0,0), true, false);
net.setInput(blob);
net.forward(layerOutput, getOutputsNames(net));
auto end = chrono::steady_clock::now();

C++ code that I am measuring time

In C++:

blob is Mat type, layerOutput is vector<Mat> type, getOutputsNames returns in vector<string> names.

In python: blob is numpy.ndarray type, layerOutput is tuple type, outputLayers is a list type object.

Both backends and targets are the same and backend is opencv, target is cpu, and I am using same yolov4 weight and config files in the same directories

When measuring the time, it takes ~180-200 ms in python, yet in C++ it takes ~220-250 ms. Since C++ is a compiled language, I expect C++ to be work quite fast than the python, which is not the case surprisingly.

What might be the reason that python works faster than the C++? Also what are your solutions to this?

Thanks in advance!

Solution

I figured what the problem is, I have customized OpenCV for c++ to gain advantage of the CUDA cores in my Jetson Orin, yet the python uses general OpenCV stored in other directory, which doesn't have CUDA support. When I changed the OpenCV compilation for C++ to the general one, it worked fast as expected since in my customized compilation I also customized the CPU parallelization which seems to be slower than the default one.