Search code examples
c++parallel-processingopenmptbb

Parallelizing QuickHull: OpenMP gives small speedup whilst TBB gives negative speedup


I'm a beginner when it comes down to shared-memory programming using OpenMP and TBB.

I'm implementing a parallel version of the QuickHull algorithm (http://en.wikipedia.org/wiki/QuickHull) to find the convex hull of a set of data points. (http://en.wikipedia.org/wiki/Convex_hull).

Basically the following tasks can be done in parallel:

  1. Find the leftmost and rightmost point (P and Q).
  2. Split up the entire dataset according to the line connection these two points (P and Q).
  3. For each of these two sets, get the point that is the furthest from the line on which the last split occurred (PQ).
  4. Split the data into two sets based on the furthest point (C): One containing all elements right of the line PC and one containing all elements right of the line QC.

Note that part 3 and 4 are done recursively until every subset is empty.

First I've done this with OpenMP using mostly the #pragma omp parallel for.... But I personally think I'm doing something wrong as the speedup never passes 2x. Secondly I've also made an implementation using Intel TBB in order to compare the speedups, this resulted in a negative speedup (even for large data sets). Using TBB, I used both tbb::parallel_for() and tbb::parallel_reduce().

Basically my question can be split up into two parts: 1) OpenMP implementation 2) TBB implementation

Part 1

As can be seen on the benchmark below when the size of my data set increases also does the speedup when using sufficient threads.

Runtimes for small sets up to 10,000 elements Runtimes for larger sets starting from 100,000 elements Speed up comparing OpenMP and sequential algorithm

Notice that the speedup does not go past 2x which I personally think is quite bad for this algorithm as a huge portion is parallelizable. Here is the relevant code:

     void ompfindHull(POINT_VECTOR &input, Point* p1, Point* p2, POINT_VECTOR& output){
        // If there are no points in the set... just stop. This is the stopping criteria for the recursion. :)
        if (input.empty() || input.size() == 0) return;

        int num_threads = omp_get_max_threads();
        // Get the point that is the farthest from the p1-p2 segment
        Point** farthest_sub = new Point*[num_threads];
        double** distance_sub = new double*[num_threads];
        int thread_id;
        #pragma omp parallel private (thread_id)
        {
            thread_id = omp_get_thread_num();
            farthest_sub[thread_id] = input[0];
            distance_sub[thread_id] = new double(0);

            #pragma omp for
                for (int index = 1; index < input.size(); index++){
                    Point*a = p1;
                    Point*b = p2;
                    Point*c = input[index];

                    double distance = ( ( b->x - a->x ) * ( a->y - c->y ) ) - ( ( b->y - a->y ) * ( a->x - c->x ) );
                    distance = distance >= 0 ? distance : -distance;

                    double cur_distance = *distance_sub[thread_id];
                    if (cur_distance < distance){
                        farthest_sub[thread_id] = input[index];
                        distance_sub[thread_id] = new double(distance);
                    }
                }
        }

        Point* farthestPoint = farthest_sub[0];
        int distance = *distance_sub[0];
        for (int index = 1; index < num_threads; index++){
            if (distance < *distance_sub[index]){
                farthestPoint = farthest_sub[index];
            }
        }

        delete [] farthest_sub;
        delete [] distance_sub;

        // Add the farthest point to the output as it is part of the convex hull.
        output.push_back(farthestPoint);

        // Split in two sets.
        // The first one contains points right from p1 - farthestPoint
        // The second one contains points right from farthestPoint - p2
        vector<POINT_VECTOR> left_sub(num_threads), right_sub(num_threads);
        #pragma omp parallel private(thread_id)
        {
            thread_id = omp_get_thread_num();
            #pragma omp for
            for (size_t index = 0; index < input.size(); index++){
                Point* curPoint = input[index];
                if (curPoint != farthestPoint){
                    if (getPosition(p1, farthestPoint, curPoint) == RIGHT){
                        left_sub[thread_id].push_back(curPoint);
                    } else if (getPosition(farthestPoint, p2, curPoint) == RIGHT){
                        right_sub[thread_id].push_back(curPoint);
                    }
                }
            }
        }

        //Merge all vectors into a single vector :)
        POINT_VECTOR left, right;
        for (int index=0; index < num_threads; index++){
            left.insert(left.end(), left_sub[index].begin(), left_sub[index].end());
            right.insert(right.end(), right_sub[index].begin(), right_sub[index].end());
        }

        input.clear();


        // We do more recursion :)
        ompfindHull(left, p1, farthestPoint, output);
        ompfindHull(right, farthestPoint, p2, output);
     }

     double ompquickHull(POINT_VECTOR input, POINT_VECTOR& output){
        Timer timer;
        timer.start();

        // Find the left- and rightmost point.
        // We get the number of available threads.
        int num_threads = omp_get_max_threads();
        int thread_id;
        POINT_VECTOR minXPoints(num_threads);
        POINT_VECTOR maxXPoints(num_threads);

        // Devide all the points in subsets between several threads. For each of these subsets
        // we need to find the minX and maxX
        #pragma omp parallel shared(minXPoints,maxXPoints, input) private(thread_id)
        {
            thread_id = omp_get_thread_num();
            minXPoints[thread_id] = input[0];
            maxXPoints[thread_id] = input[0];

            int index;
            #pragma omp for
            for (index = 1; index < input.size(); index++)
            {
                Point* curPoint = input[index];
                if (curPoint->x > maxXPoints[thread_id]->x){
                    maxXPoints[thread_id] = curPoint;
                } else if (curPoint->x < minXPoints[thread_id]->x) {
                    minXPoints[thread_id] = curPoint;
                }
            }

            #pragma omp barrier

        }

        // We now have all the minX and maxX points of every single subset. We now use
        // these values to find the overall min and max X-point.
        Point* minXPoint = input[0], *maxXPoint = input[0];
        for (int index = 0; index < num_threads; index++){
            if (minXPoint->x > minXPoints[index]->x){
                minXPoint = minXPoints[index];
            }

            if (maxXPoint->x < maxXPoints[index]->x){
                maxXPoint = maxXPoints[index];
            }
        }

        // These points are sure to be part of the convex hull, so add them
        output.push_back(minXPoint);
        output.push_back(maxXPoint);

        // Now we have to split the set of point in subsets.
        // The first one containing all points above the line
        // The second one containing all points below the line
        const int size = input.size();
        vector<POINT_VECTOR> left_sub(num_threads), right_sub(num_threads);

        #pragma omp parallel private(thread_id)
        {
            thread_id = omp_get_thread_num();
            #pragma omp for
            for (unsigned int index = 0; index < input.size(); index++){
                Point* curPoint = input[index];
                if (curPoint != minXPoint || curPoint != maxXPoint){
                    if (getPosition(minXPoint, maxXPoint, curPoint) == RIGHT){
                        left_sub[thread_id].push_back(curPoint);
                    }
                    else if (getPosition(maxXPoint, minXPoint, curPoint) == RIGHT){
                        right_sub[thread_id].push_back(curPoint);
                    }
                }
            }
        }

        //Merge all vectors into a single vector :)
        POINT_VECTOR left, right;
        for (int index=0; index < num_threads; index++){
            left.insert(left.end(), left_sub[index].begin(), left_sub[index].end());
            right.insert(right.end(), right_sub[index].begin(), right_sub[index].end());
        }

        // We now have the initial two points belonging to the hill
        // We also split all the points into a group containing points left of AB and a group containing points right of of AB
        // We now recursively find all other points belonging to the convex hull.

        ompfindHull(left,minXPoint, maxXPoint, output);
        ompfindHull(right, maxXPoint, minXPoint, output);

        timer.end();

        return timer.getTimeElapsed();
     }

Does anyone know whether or not is normal to achieve only a 2x speedup using 8 cores whilst such a large portion of the code is parallelizable? If not, what am I doing wrong here !?

Part 2

Now comes the real problem...

Running the same tests on the TBB implementation gives the following result: Runtime for small datasets Runtime for large datasets Speedup graph

As can be seen the runtimes of the parallel implementation always exceed the runtime of the sequential one. As for the speedup graph speedups are less than one, meaning it is a negative speedup!

Here is the code of the different structs I've created:

Note that typedef tbb::concurrent_vector<Point*> CPOINT_VECTOR

EDIT: Applied Arch's comments.

    class FindExtremum{
public:
    enum ExtremumType{
        MINIMUM,MAXIMUM
    };

public:
    FindExtremum(CPOINT_VECTOR& points):fPoints(points), fMinPoint(points[0]), fMaxPoint(points[0]){}
    FindExtremum(const FindExtremum& extremum, tbb::split):fPoints(extremum.fPoints), fMinPoint(extremum.fMinPoint), fMaxPoint(extremum.fMaxPoint){}

    void join(const FindExtremum& other){
        Point* curMinPoint = other.fMinPoint;
        Point* curMaxPoint = other.fMaxPoint;

        if (isLargerThan(curMinPoint, MINIMUM)){
            fMinPoint = curMinPoint;
        }

        if (isSmallerThan(curMaxPoint, MAXIMUM)){
            fMaxPoint = curMaxPoint;
        }
    }

    void operator()(const BLOCKED_RANGE& range){
        for (size_t index = range.begin(); index < range.end(); index++){
            Point* curPoint = fPoints[index];

            if (isLargerThan(curPoint, MINIMUM)){
                fMinPoint = curPoint;
            }

            if (isSmallerThan(curPoint, MAXIMUM)){
                fMaxPoint = curPoint;
            }
        }
    }

private:
    bool isSmallerThan(const Point* point, const ExtremumType& type){
        switch (type){
        case MINIMUM:
            return fMinPoint->x < point->x;
        case MAXIMUM:
            return fMaxPoint->x < point->x;
        }
    }

    bool isLargerThan(const Point* point, const ExtremumType& type){
        return !isSmallerThan(point, type);
    }

public:
    Point* getMaxPoint(){
        return this->fMaxPoint;
    }

    Point* getMinPoint(){
        return this->fMinPoint;
    }

public:
    CPOINT_VECTOR fPoints;
    Point* fMinPoint;
    Point* fMaxPoint;

};

class Splitter{
public:
    Splitter(const CPOINT_VECTOR& points, Point* point1, Point* point2,
            Point* farthestPoint, CPOINT_VECTOR* left, CPOINT_VECTOR* right, int grainsize):
        fPoints(points), p1(point1), p2(point2), farthestPoint(farthestPoint), fLeft(left), fRight(right), fGrain(grainsize)
    {
        //fLeft = new tbb::concurrent_vector<Point*>();
        //fRight = new tbb::concurrent_vector<Point*>();
        //fLeft = new vector<Point*>();
        //fRight = new vector<Point*>();
    };

    Splitter(const Splitter& splitter, tbb::split):
        fPoints(splitter.fPoints), p1(splitter.p1), p2(splitter.p2), farthestPoint(splitter.farthestPoint),
        fLeft(splitter.fLeft), fRight(splitter.fRight), fGrain(splitter.fGrain){}

    void operator()(const BLOCKED_RANGE& range) const{
        const int grainsize = fGrain;
        Point** left = new Point*[grainsize];
        Point** right = new Point*[grainsize];
        int leftcounter = 0;
        int rightcounter = 0;
        for (size_t index = range.begin(); index < range.end(); index++){
            Point* curPoint = fPoints[index];
            if (curPoint != farthestPoint){
                if (getPosition(p1, farthestPoint, curPoint) == RIGHT){
                    left[leftcounter++] = curPoint;
                } else if (getPosition(farthestPoint, p2, curPoint) == RIGHT){
                    right[rightcounter++] = curPoint;
                }
            }
        }
        appendVector(left,leftcounter,*fLeft);
        appendVector(right,rightcounter,*fRight);
    }

public:
    Point* p1;
    Point* p2;
    Point* farthestPoint;
    int fGrain;
    CPOINT_VECTOR* fLeft;
    CPOINT_VECTOR* fRight;
    CPOINT_VECTOR fPoints;

};

class InitialSplitter{
public:
    InitialSplitter(const CPOINT_VECTOR& points, CPOINT_VECTOR* left, CPOINT_VECTOR* right,
                    Point* point1, Point* point2, int grainsize):
            fPoints(points), p1(point1), p2(point2), fLeft(left), fRight(right), fGrain(grainsize){}

    InitialSplitter(const InitialSplitter& splitter, tbb::split):
        fPoints(splitter.fPoints), p1(splitter.p1), p2(splitter.p2),
        fLeft(splitter.fLeft), fRight(splitter.fRight), fGrain(splitter.fGrain){
    }

    void operator()(const BLOCKED_RANGE& range) const{
        const int grainsize = fGrain;
        Point** left = new Point*[grainsize];
        Point** right = new Point*[grainsize];
        int leftcounter = 0;
        int rightcounter = 0;
        for (size_t index = range.begin(); index < range.end(); index++){
            Point* curPoint = fPoints[index];
            if (curPoint != p1 || curPoint != p2){
                if (getPosition(p1, p2, curPoint) == RIGHT){
                    left[leftcounter++] = curPoint;
                } else if (getPosition(p2, p1, curPoint) == RIGHT){
                    right[rightcounter++] = curPoint;
                }
            }
        }
        appendVector(left,leftcounter,*fLeft);
        appendVector(right,rightcounter,*fRight);
    }

public:
    CPOINT_VECTOR fPoints;
    int fGrain;
    Point* p1;
    Point* p2;
    CPOINT_VECTOR* fLeft;
    CPOINT_VECTOR* fRight;
};

class FarthestPointFinder{
public:
    FarthestPointFinder(const CPOINT_VECTOR& points, Point* p1, Point* p2):
        fPoints(points), fFarthestPoint(points[0]),fDistance(-1), p1(p1), p2(p2){}

    FarthestPointFinder(const FarthestPointFinder& fpf, tbb::split):
        fPoints(fpf.fPoints), fFarthestPoint(fpf.fFarthestPoint),fDistance(-1), p1(fpf.p1), p2(fpf.p2){}

    void operator()(const BLOCKED_RANGE& range){
        for (size_t index = range.begin(); index < range.end(); index++){
            Point* curPoint = fPoints[index];
            double curDistance = distance(p1,p2,curPoint);
            if (curDistance > fDistance){
                fFarthestPoint = curPoint;
                fDistance = curDistance;
            }
        }
    }

    void join(const FarthestPointFinder& other){
        if (fDistance < other.fDistance){
            fFarthestPoint = other.fFarthestPoint;
            fDistance = other.fDistance;
        }
    }

public:
    Point* getFarthestPoint(){
        return this->fFarthestPoint;
    }

public:
    CPOINT_VECTOR fPoints;
    Point* fFarthestPoint;
    int fDistance;
    Point* p1;
    Point* p2;
};

Followed by the QuickHull code:

   void tbbfindHull(CPOINT_VECTOR &input, Point* p1, Point* p2, POINT_VECTOR& output, int max_threads){
    // If there are no points in the set... just stop. This is the stopping criteria for the recursion. :)
    if (input.empty() || input.size() == 0) return;
    else if (input.size() == 1) {
        output.push_back(input[0]);
        return;
    }

    // Get the point that is the farthest from the p1-p2 segment

    int GRAINSIZE = ((double)input.size())/max_threads;

    FarthestPointFinder fpf(input, p1, p2);
    tbb::parallel_reduce(BLOCKED_RANGE(0,input.size(),GRAINSIZE), fpf);
    Point *farthestPoint = fpf.getFarthestPoint();

    // Add the farthest point to the output as it is part of the convex hull.
    output.push_back(farthestPoint);

    // Split in two sets.
    // The first one contains points right from p1 - farthestPoint
    // The second one contains points right from farthestPoint - p2

    CPOINT_VECTOR* left = new CPOINT_VECTOR();
    CPOINT_VECTOR* right = new CPOINT_VECTOR();

    Splitter splitter(input,p1,p2,farthestPoint, left, right, GRAINSIZE);
    tbb::parallel_for(BLOCKED_RANGE(0,input.size(), GRAINSIZE), splitter);

    // We do more recursion :)
    tbbfindHull(*left, p1, farthestPoint, output, max_threads);
    tbbfindHull(*right, farthestPoint, p2, output, max_threads);

}

/**
 * Calling the quickHull algorithm!
 */
double tbbquickHull(POINT_VECTOR input_o, POINT_VECTOR& output, int max_threads){

    CPOINT_VECTOR input;
    for (int i =0; i < input_o.size(); i++){
        input.push_back(input_o[i]);
    }

    int GRAINSIZE = input.size()/max_threads;
    Timer timer;
    timer.start();

    // Find the left- and rightmost point.
    FindExtremum fextremum(input);
    tbb::parallel_reduce(BLOCKED_RANGE(0, input.size(),GRAINSIZE), fextremum);

    Point* minXPoint = fextremum.getMinPoint();
    Point* maxXPoint = fextremum.getMaxPoint();

    // These points are sure to be part of the convex hull, so add them
    output.push_back(minXPoint);
    output.push_back(maxXPoint);

    // Now we have to split the set of point in subsets.
    // The first one containing all points above the line
    // The second one containing all points below the line
    CPOINT_VECTOR* left = new CPOINT_VECTOR;
    CPOINT_VECTOR* right = new CPOINT_VECTOR;

    //Timer temp1;
    //temp1.start();
    InitialSplitter splitter(input, left, right, minXPoint, maxXPoint, GRAINSIZE);
    tbb::parallel_for(BLOCKED_RANGE(0, input.size(),GRAINSIZE), splitter);
    // We now have the initial two points belonging to the hill
    // We also split all the points into a group containing points left of AB and a group containing points right of of AB
    // We now recursively find all other points belonging to the convex hull.

    tbbfindHull(*left,minXPoint, maxXPoint, output, max_threads);
    tbbfindHull(*right, maxXPoint, minXPoint, output, max_threads);
    timer.end();

    return timer.getTimeElapsed();
}

In TBB, when timing the different parallel portions of the code, some anomalies could be noticed. The initial division of the total subset InitialSplitter into two subsets using tbb::parallel_for() takes almost as long as the entire runtime of the corresponding OpenMP version and this time does not vary when a different number of threads are used. This is strange as a significant speedup can be noticed when timing inside the InitialSplitter-object that is passed on as argument to tbb::parallel_for(). The for-loop which is iterated in the InitialSplitters' operator()-method shows a speedup that is expected when the number of threads are increased.

I think it is very strange that a single tbb::parallel_for() - such as for example the one in the initialization taking an InitialSplitter instance - takes as long as the entire OpenMP implementation to run. I also think it is quite strange that when timing around the tbb::parallel_for() no speedup can be observed, whilst timing inside the InitialSplitters-operator() a near-linear speedup can be observed...

Is there anyone out here who can help me out !?

Thanks in advance!


Solution

  • I have a few comments that you might find useful.

    In general I try to avoid accessing the thread number directly (thread_id). Instead of defining arrays with a size equal to the number of threads you should define your variables in the parallel block (this automatically makes them private). Then instead of looping over the number of threads after the parallel block you should use an atomic, critical, or single approach (I'm not sure which is best here). Something like this.

    Point* farthestPoint;
    //int distance = *distance_sub[0];  //is this a bug shouldn't distance be a double?
    double distance = 0
    #pragma omp parallel private
    {
        Point farthest_sub = input[0];
        double distance_sub = 0;
         #pragma omp for nowait
         for (int index = 1; index < input.size(); index++){
         // for loop code
         }
         #pragma omp critical 
         {        
             if (distance < distance_sub){
                farthestPoint = farthest_sub;
             }
         }
         #pragma omp barrier 
         //next part of code 
    }
    

    One problem you might have now is false sharing. Each thread is trying to write to arrays in the same cache line (e.g. in the array distance_sub[num_threads]). I'm not sure what OpenMP does when you declare the values inside the the parallel block but I suspect that it's more likely to allocate the values to avoid false sharing.

    Another comment you, should try and avoid calling OpenMP so many times, especially on a small number of elements. There is an overhead with OpenMP. I would try and get as much in one parallel block as possible using a barrier or whatever.

    Also, in your code you have int distance = *distance_sub[0] is this a bug? Shouldn't distance be a double?

    Lastly, a pedantic point. I doubt you have 8 cores. You probably have 4 cores and 8 hardware threads due to Intel hyper threading. The distinction can be important.