parallel-processing levenshtein-distance tbb

levenshtein algorithm parallel

I've implemented the algorithm using parallel_for. But mostly I use synchronized sections, so I have no profit. Maybe there is a better option?

    tbb::parallel_for (tbb::blocked_range<int>(1, m * n), apply_transform(d, j, this, m, n));

    void apply_transformation(int * d, int i, int j, int n){
        int elem1 = (*v1)[i];
        int elem2 = (*v2)[j];
        if(elem1 == elem2){
            dLock.acquire(dMutex);
            d[i*n + j] = d[(i-1)*n + j-1];       // no operation required
            dLock.release();
        } else {
            dLock.acquire(dMutex);
            d[i*n + j] = std::min(std::min(d[(i-1)*n + j] + 1, //deletion
                    d[i*n + j-1] + 1), //insertion
                    d[(i-1)*n + j-1] + 1); //substitution
            dLock.release();
        }
    }

    class apply_transform{
        int * array;
        int m_j;
        Levenstein * m_l;

        int m, n;
        public:
            apply_transform (int* a, int j, Levenstein * l, int width, int height):
                array(a), m_j(j), m_l(l), m(width), n(height) {}

            void operator()(const tbb::blocked_range<int>& r ) const {
                for (int i=r.begin(); i!=r.end(); i++ ){
                    m_l->apply_transformation(array, i, m_j, n);
                }
            }
    };

Solution

The Levenstein distance calculation is essentially a wavefront pattern, where computation of d(i,j) requires knowledge of d(i-1,j-1), d(i-1,j), and d(i,j-1). These dependencies naturally form a DAG of tasks; a task to compute d(i,j) can only be executed when all its predecessors (and, in turn, their predecessors, etc) are completed. Tasks that have all their dependencies resolved and don't depend on each other (e.g. d(1,2) and d(2,1)) can be processed in parallel. Except for following the dependencies, no other synchronization is needed.

There are a few ways to express the wavefront pattern in TBB: with low-level tasks (complex and not recommended), with parallel_do plus atomic counters (as explained in TBB Design Patterns documentation; the example used there is very close to what you do) and with the flow graph (as explained in the TBB blog). I recommend you to study these docs and make your own experiments.

Also note that the amount of work for computing a single d(i,j) is way too small to justify overhead of parallel execution. To achieve some speedup, aggregate a rectangular block of computations into a single task, processing block elements serially within the task. The blocks will have the same dependency pattern. But the bigger blocks you make the less parallelism is available, so you may want to experiment with block sizes to get best performance.