Search code examples
c++object-detectionmediapipetflitepalm

Mediapipe palm detection model outputs


I want to add Mediapipe hand landmark detection to my C++ project, but mediapipe doesn't support CMake so I had to find another way, I found that the hand landmark detection is a two-model run in serial. the first model is palm detection and the second is landmark detection, from the mediapipe website I reached to the two models the models are tflite models so adding them shouldn't be difficult but I had a problem figuring out how to convert the palm output to bboxes, the model gives me two outputs one with shape (2016, 18) and a second (2016,) the first one should be a

[number of anchors, 18]

0 - 4 are bounding box offset, width, and height: dx, dy, w ,h

4 - 18 are 7 hand keypoint x and y coordinates: x1,y1,x2,y2,...x7,y7

the second should be the accuracy for each bbox

(2016, 18)[0] –-> [-3896.9226 5079.4067 6987.4683 7181.9116 992.45654 4032.2664 -7006.974 -2635.5786 -4408.5684 -3171.507 -2381.8406 -3177.1763 -1996.8119 -2633.921 2559.212 5521.417 4017.0728 4059.862 ]

(2016,)[0] ---> -2090.7869

Could you please help me figure the needed math to end up with bbox?

During my research, I found the same problem at https://github.com/google/mediapipe/issues/3751 And in https://github.com/aashish2000/hand_tracking But I couldn’t understand how to end up with bbox


Solution

  • the main steps needed to convert the mediapipe palm models output to a rectangle are explained in this repo terryky/tflite_gles_app.git, they used the old models but the main steps are the same, in my repo, I made the necessary changes to run the new models both the palm model and the hand landmark detection you can found the source code here.hand-landmarks-cpp.git. I noticed that the python version runs at less 3X time faster than the CPP version probably because of the lag in TensorFlow C++ API -tflite- or maybe I have a bug in my code how knows:)