c++opencv gstreamer video-processing nvidia-jetson-nano

How to reduce delay in my gstreamer video pipeline reading from a Jetson Nano?

I have a program that captures live video feed from a CSI camera connected to a jetson nano and estimates the pose of an aruco marker. However, when I move the aruco marker in real life, the position returned by my code only changes about 0.5s-1s later. I am very new to all of this and I can't figure out where the bottleneck is.

My code is below. I measured the average time it takes for the main loop where the frames are analysed to complete and got around 50-60 ms, so I dont think the detection and pose estimation are the problems themselves.

I commented cv::imshow("Image", image); (to see if the delay was coming from that) but the change in the delay is minimal to none. Since a window is no longer shown, cv::waitKey(1) == 'q' will no longer trigger, so that is why I added total_time >= 30 as a different condition. From what I understand, whatever time I set in cv::waitKey will make it so that the program waits that set amount of time before continuing. Taking this into accout, I changed the line to cv::waitKey(1000) == 'q', so that the program has to wait a second before analysing the next frame(effectively setting the fps to 1) to see what the delay would be. Well, it takes around 12 seconds for any movement in the real world for any changes to be seen in the output(I uncommented cv::imshow("Image", image); to see what frames were being analysed).

I thought that when the time in cv::waitKey is over, the current frame would be captured, but it seems like it is either capturing an old frame for some reason or de videocapture itself has a massive delay.

Can anyone help me figure out what the problem is and any possible ways to fix it? Thanks in advance.

#include <opencv2/opencv.hpp>
#include <opencv2/aruco.hpp>
#include <opencv2/objdetect/aruco_detector.hpp>
#include <opencv2/objdetect/aruco_dictionary.hpp>
#include <opencv2/core/core.hpp>
#include <opencv2/imgproc.hpp>

std::string gstreamer_pipeline(int capture_width, int capture_height, int display_width, int display_height, int framerate, int flip_method, int sensor_mode)
{
    return "nvarguscamerasrc sensor_mode=" + std::to_string(sensor_mode) + " ! video/x-raw(memory:NVMM), width=(int)" + std::to_string(capture_width) + ", height=(int)" +
           std::to_string(capture_height) + ", framerate=(fraction)" + std::to_string(framerate) +
           "/1 ! nvvidconv flip-method=" + std::to_string(flip_method) + " ! video/x-raw, width=(int)" + std::to_string(display_width) + ", height=(int)" +
           std::to_string(display_height) + ", format=(string)GRAY8 ! videoconvert ! video/x-raw, format=(string)GRAY8 ! appsink";
}

bool loadCameraCalibration(const std::string &filename, cv::Mat &cameraMatrix, cv::Mat &distCoeffs)
{
    cv::FileStorage fs("calibration_params.xml", cv::FileStorage::READ);
    if (!fs.isOpened())
    {
        return false;
    }

    fs["Camera_Matrix"] >> cameraMatrix;
    fs["Distorsion_Coefficients"] >> distCoeffs;
    fs.release();

    return true;
}

void detectCorners(const cv::Mat &image, const cv::aruco::ArucoDetector &detector, std::vector<int> &ids, std::vector<std::vector<cv::Point2f>> &corners)
{
    // Image is already grayscale.

    // Detect the markers in the grayscale image.
    std::vector<std::vector<cv::Point2f>> rejectedCorners;
    detector.detectMarkers(image, corners, ids);

    // Refine the corners using cornerSubPix.
    for (auto &corner : corners)
    {
        cv::cornerSubPix(image, corner, cv::Size(3, 3), cv::Size(-1, -1), cv::TermCriteria(cv::TermCriteria::EPS + cv::TermCriteria::COUNT, 30, 0.01));
    }
}

int main()
{
    // Set camera parameters
    int capture_width = 1640;
    int capture_height = 1232;
    int display_width = 1640;
    int display_height = 1232;
    int framerate = 30;
    int flip_method = 2;
    int sensor_mode = 3;

    std::string pipeline = gstreamer_pipeline(capture_width, capture_height, display_width, display_height, framerate, flip_method, sensor_mode);
    std::cout << "Using pipeline: \n\t" << pipeline << "\n";

    // Load the camera matrix and distortion coefficients from the XML file.
    cv::Mat cameraMatrix, distCoeffs;
    if (!loadCameraCalibration("calibration_params.xml", cameraMatrix, distCoeffs))
    {
        std::cout << "Failed to load camera calibration parameters." << std::endl;
        return -1;
    }

    // Set the marker length.
    float markerLength = 0.10;

    // Create a 4x1 coordinate system for the marker.
    cv::Mat objPoints(4, 1, CV_32FC3);
    objPoints.ptr<cv::Vec3f>(0)[0] = cv::Vec3f(-markerLength / 2.f, markerLength / 2.f, 0);
    objPoints.ptr<cv::Vec3f>(0)[1] = cv::Vec3f(markerLength / 2.f, markerLength / 2.f, 0);
    objPoints.ptr<cv::Vec3f>(0)[2] = cv::Vec3f(markerLength / 2.f, -markerLength / 2.f, 0);
    objPoints.ptr<cv::Vec3f>(0)[3] = cv::Vec3f(-markerLength / 2.f, -markerLength / 2.f, 0);

    // Create a detector and dictionary.
    cv::aruco::DetectorParameters detectorParams = cv::aruco::DetectorParameters();
    cv::aruco::Dictionary dictionary = cv::aruco::getPredefinedDictionary(cv::aruco::DICT_4X4_50);
    cv::aruco::ArucoDetector detector(dictionary, detectorParams);

    // Create a VideoCapture object to grab frames from the camera.
    cv::VideoCapture cap(pipeline, cv::CAP_GSTREAMER);
    if (!cap.isOpened())
    {
        std::cout << "Failed to open camera." << std::endl;
        return -1;
    }

    double total_time = 0;
    int count_loop = 0;

    // Main loop to grab frames and detect markers.
    while (true)
    {
        count_loop += 1;
        // Start timer
        double start_time = static_cast<double>(cv::getTickCount());

        cv::Mat image;
        if (!cap.read(image))
        {
            std::cout << "Capture read error" << std::endl;
            break;
        }

        // Detect the markers in the image.
        std::vector<int> ids;
        std::vector<std::vector<cv::Point2f>> corners;
        detectCorners(image, detector, ids, corners);

        // If at least one marker was detected, draw it and its pose.
        if (ids.size() > 0)
        {
            cv::aruco::drawDetectedMarkers(image, corners, ids);

            // Initialize rvecs and tvecs.
            std::vector<cv::Vec3d> rvecs(ids.size());
            std::vector<cv::Vec3d> tvecs(ids.size());

            // Calculate the pose of each marker.
            for (int i = 0; i < ids.size(); i++)
            {
                cv::solvePnP(objPoints, corners[i], cameraMatrix, distCoeffs, rvecs[i], tvecs[i]);
                // One thing to note is that solvePnP returns the world coordinates relative to the camera
            }
            for (int j = 0; j < ids.size(); j++)
            {
                // Check if the current marker has ID 13
                if (ids[j] == 13)
                {
                    // Extract x, y, z from the translation vector (tvec).
                    double x = tvecs[j][0];
                    double y = tvecs[j][1];
                    double z = tvecs[j][2];

                    // Calculate yaw, pitch, and roll from the rotation vector (rvec).
                    cv::Mat rotMat;
                    cv::Rodrigues(rvecs[j], rotMat);

                    double yaw, pitch, roll;

                    // https://stackoverflow.com/questions/11514063/extract-yaw-pitch-and-roll-from-a-rotationmatrix

                    /*
                    Since Opencv uses the pinhole camera model(z-looking out of the camera,
                    x-to the right, y- down),yaw pitch and roll arent the same. Here, yaw will be the
                    rotation around the y axis, pitch the rotation around the x axis and roll the
                    rotation around the z axis

                    Normally, we would calculate like this:

                    // Calculate yaw (around z-axis).
                    yaw = atan2(rotMat.at<double>(1, 0), rotMat.at<double>(0, 0));

                    // Calculate pitch (around y-axis).
                    pitch = atan2(-rotMat.at<double>(2, 0), std::sqrt(rotMat.at<double>(2, 1) * rotMat.at<double>(2, 1) + rotMat.at<double>(2, 2) * rotMat.at<double>(2, 2)));

                    // Calculate roll (around x-axis).
                    roll = atan2(-rotMat.at<double>(2, 1), rotMat.at<double>(2, 2));

                    But we must change some of these to be coherent with the pinhole model:
                    roll becomes pitch, pitch becomes yaw and yaw becomes roll
                    */

                    // Using pinhole camera model:
                    // Calculate roll (around z-axis).
                    roll = atan2(rotMat.at<double>(1, 0), rotMat.at<double>(0, 0));

                    // Calculate yaw (around y-axis).
                    yaw = atan2(-rotMat.at<double>(2, 0), std::sqrt(rotMat.at<double>(2, 1) * rotMat.at<double>(2, 1) + rotMat.at<double>(2, 2) * rotMat.at<double>(2, 2)));

                    // Calculate picth (around x-axis).
                    pitch = atan2(-rotMat.at<double>(2, 1), rotMat.at<double>(2, 2));

                    // Convert yaw, pitch, and roll to degrees.
                    double yaw_degrees = yaw * (180.0 / CV_PI);
                    double pitch_degrees = pitch * (180.0 / CV_PI);
                    double roll_degrees = roll * (180.0 / CV_PI);

                    //Print the measurements             
                    std::cout << "x: " << x << std::endl;
                    std::cout << "y: " << y << std::endl;
                    std::cout << "z: " << z << std::endl;
                    std::cout << "yaw: " << yaw_degrees << std::endl;
                    std::cout << "roll: " << roll_degrees << std::endl;
                    std::cout << "pitch: " << pitch_degrees << std::endl;

                   
                }
            }
            
           
            // Draw the pose of each marker.
            // Since I am no longer showing the image, this part of the is no longer needed so I commented it
            /*
            for (int i = 0; i < ids.size(); i++)
            {
                cv::drawFrameAxes(image, cameraMatrix, distCoeffs, rvecs[i], tvecs[i], 0.1);
            }
            */
        }

        // Show the image with detected markers.
        // cv::imshow("Image", image);

        // Calculate elapsed time in seconds
        double elapsed_time = (static_cast<double>(cv::getTickCount()) - start_time) / cv::getTickFrequency();
        std::cout << "Elapsed Time: " << elapsed_time << " seconds" << std::endl;
        total_time += elapsed_time;

        // Break the loop if 'q' key is pressed.
        if ((cv::waitKey(1) == 'q') || (total_time >= 30))
        {
            break;
        }
    }

    //Calculate average time for each iteration of the loop
    std::cout << "Average time= " << total_time / count_loop << std::endl;

    cap.release();
    cv::destroyAllWindows();
    return 0;
}

EDIT: The problems doesnt seem to come from the aruco logic itself. Without the aruco logic, the delay is ever so slightly smaller (For example, when I set the time in cv::waitKey() to 1ms, the delay is ~0.5s, maybe a bit less. I can tell the problem is still present without the aruco logic since when I change the time in cv::waitKey() to 1000ms, the same ~12 second it there (In this example, if the removal of aruco logic made a difference, I cant notice it). Here is the updated code:

#include <opencv2/opencv.hpp>


std::string gstreamer_pipeline(int capture_width, int capture_height, int display_width, int display_height, int framerate, int flip_method, int sensor_mode)
{
    return "nvarguscamerasrc sensor_mode=" + std::to_string(sensor_mode) + " ! video/x-raw(memory:NVMM), width=(int)" + std::to_string(capture_width) + ", height=(int)" +
           std::to_string(capture_height) + ", framerate=(fraction)" + std::to_string(framerate) +
           "/1 ! nvvidconv flip-method=" + std::to_string(flip_method) + " ! video/x-raw, width=(int)" + std::to_string(display_width) + ", height=(int)" +
           std::to_string(display_height) + ", format=(string)GRAY8 ! videoconvert ! video/x-raw, format=(string)GRAY8 ! appsink";
}

int main()
{
    // Set camera parameters
    int capture_width = 1640;
    int capture_height = 1232;
    int framerate = 30;
    int flip_method = 2;
    int sensor_mode = 3;

    std::string pipeline = gstreamer_pipeline(capture_width, capture_height, capture_width, capture_height, framerate, flip_method, sensor_mode);
    std::cout << "Using pipeline: \n\t" << pipeline << "\n";

    // Create a VideoCapture object to grab frames from the camera.
    cv::VideoCapture cap(pipeline, cv::CAP_GSTREAMER);
    if (!cap.isOpened())
    {
        std::cout << "Failed to open camera." << std::endl;
        return -1;
    }


    cv::namedWindow("Image", cv::WINDOW_NORMAL);  // Create a window for displaying the image

    // Main loop to grab frames, display them, and check for user input.
    while (true)
    {
        cv::Mat image;
        if (!cap.read(image))
        {
            std::cout << "Capture read error" << std::endl;
            break;
        }

        // Display the captured image
        cv::imshow("Image", image);

        // Check if "q" key is pressed using cv::waitKey
        if (cv::waitKey(1) == 'q')
        {
            break;
        }
    }


    cap.release();
    cv::destroyAllWindows();
    return 0;
}

EDIT 2:

If I set the time in cv::waitKey to 2000ms, it takes the same 12 frames for the image to show. Maybe there is some buffer somewhere that causes this delay? I tried using cap.set(cv::CAP_PROP_BUFFERSIZE, 1); below cv::VideoCapture cap(pipeline, cv::CAP_GSTREAMER); but that didn't change anything.

Solution

SOLUTION:

I followed @pptaszni (https://stackoverflow.com/users/4165552/pptaszni) 's suggestion of changing the protocol to drop frames and the delay has been reduced to a completely acceptable level and it is what I would expect. I dont really have a reliable way of measuring it, but if I had to guess it is probably around ~200ms with all of the aruco logic included. Here is the updated pipeline function:

std::string gstreamer_pipeline(int capture_width, int capture_height, int display_width, int display_height, int framerate, int flip_method, int sensor_mode)
{
    return "nvarguscamerasrc sensor_mode=" + std::to_string(sensor_mode) + " ! video/x-raw(memory:NVMM), width=(int)" + std::to_string(capture_width) + ", height=(int)" +
           std::to_string(capture_height) + ", framerate=(fraction)" + std::to_string(framerate) +
           "/1 ! nvvidconv flip-method=" + std::to_string(flip_method) + " ! video/x-raw, width=(int)" + std::to_string(display_width) + ", height=(int)" +
           std::to_string(display_height) + ", format=(string)GRAY8 ! videoconvert ! video/x-raw, format=(string)GRAY8 ! appsink drop=true sync=false";
}

Thanks everyone!