Search code examples
c++opencvcluster-computingsungridengine

Grid engine cluster + OpenCV: strange behaviour


I'm using a Grid Engine cluster for running some OpenCV code. The code runs well when executed locally, but when submitted to the grid it's not working. I extracted here a minimal example.

In the directory ~/code/ I have a file test.cpp containing the following code:

#include <opencv2/core.hpp>
#include <iterator>
#include <string>
#include <sys/types.h>
#include <sys/stat.h>
using namespace cv;
using namespace std;


int main(int ac, char** av)
{    
    /// Create a random matrix
    Mat M;

    /// Create a subfolder
    string folderName = "sub/";
    mkdir(folderName.c_str(),0777);

    return 0;
}

The code is compiled without errors.

When executing locally, i.e.

username@machine:~/code$ ./test

it creates a subfolder, i.e. ~/code/sub, as expected.

For submitting to the grid, I created a job script job.sh in the home directory (i.e. ~/job.sh) containing

cd code/
./test

and then submit using

qsub job.sh

Nothing happened. (And no errors).

However, when I removed the line

Mat M;

it did create the folder as expected.

What are the possible reasons for this behaviour? I'm thinking of something like the shared libs of OpenCV weren't installed in other computers of the grid, but I'm not sure and I don't know how to verify that.

Thank you in advance for any suggestions.


Solution

  • The libraries need to be accessible to all execution nodes in queue you want to submit job to. If execution nodes have access to shared location, such as NFS mount, you can install the libraries there. Otherwise, you need to install required libs on all execution nodes. Additional link regarding SET_LIB_PATH:

    blogs.oracle.com/templedf/entry/inheriting_job_environment

    While this would help point to right location, the libraries still need to be accessible