c++multithreading openmp sparse-matrix slowdown

openMP slows down when passing from 2 to 4 threads doing binary searches in a custom container

I'm currently having a problem parallelizing a program in c++ using openMP. I am implementing a recommendation system with a user-based collaborative filtering method. To do that, I implemented a sparse_matrix class as a dictionary of dictionaries (where I mean a sort of python dictionary). In my case, since insertion is only done at the beginning of the algorithm when data is read from file, I implemented a dictionary as a std library vector of pair objects (key, value) with a flag that indicates if the vector is sorted. if the vector is sorted, a key is searched using binary searches. otherwise the vector is first sorted and then searched. Alternatively, it is possible to scan the dictionary's entries linearly for example in loops on all the keys of the dictionary. The relevant portion of the code that is causing problems is the following

void compute_predicted_ratings_omp (sparse_matrix &targets,
                                    sparse_matrix &user_item_rating_matrix,
                                    sparse_matrix &similarity_matrix,
                                    int k_neighbors)
{
    // Auxiliary private variables
    int user, item;
    double predicted_rating;
    dictionary<int,double> target_vector, item_rating_vector, item_similarity_vector;

    #pragma omp parallel shared(targets, user_item_rating_matrix, similarity_matrix)\
                         private(user, item, predicted_rating, target_vector, item_rating_vector, item_similarity_vector)
    {
        if (omp_get_thread_num() == 0)
        std::cout << " - parallelized on " << omp_get_num_threads() << " threads: " << std::endl;

        #pragma omp for schedule(dynamic, 1)
        for (size_t iter_row = 0; iter_row < targets.nb_of_rows(); ++iter_row)
        {
            // Retrieve target user
            user = targets.row(iter_row).get_key();

            // Retrieve the user rating vector.
            item_rating_vector = user_item_rating_matrix[user];

            for (size_t iter_col = 0; iter_col < targets.row(iter_row).value().size(); ++iter_col)
            {
                // Retrieve target item
                item = targets.row(iter_row).value().entry(iter_col).get_key();

                // retrieve similarity vector associated to the target item
                item_similarity_vector = similarity_matrix[item];

                // Compute predicted rating
                predicted_rating = predict_rating(item_rating_vector,
                                                  item_similarity_vector,
                                                  k_neighbors);

                // Set result in targets
                targets.row(iter_row).value().entry(iter_col).set_value(predicted_rating);
            }
        }
    }
}

In this function I compute the predicted rating for a series of target pairs (user, item) (this is simply a weighted average). To do that, I do an outer loop on the target users (which are on the rows of the targets sparse matrix) and I retrieve the rating vector for the current user performing a binary search on the rows of the user_item_rating_matrix. Then, for each column in the current row (i.e. for each item) I retrieve another vector associated to the current item from the sparse matrix similarity_matrix. With these two vectors, I compute the prediction as a weighted average of their elements (on a subset of the items in common between the two vectors).

My problem is the following: I want to parallelize the outer loop using openMP. In the serial version, this functions takes around 3 secs. With openMP on 2 threads, it takes around 2 secs (which it is not bad since I still have some work imbalances in the outerloop). When using 4 threads, it takes 7 secs. I cannot understand what is the cause of this slowdown. Do you have any idea?

I have already thought about the problem and I share my considerations with you:

I access the sparse_matrices only in read mode. Since the matrices are pre-sorted, all the binary searches should not modify the matrices and no race-conditions should derive.
Various threads could access to the same vector of the sparse matrix at the same time. I read something about false sharing, but since I do not write in these vectors I think this should not be the reason of the slowdown.
The parallel version seems to work fine with two threads (even if the speedup is lower than expected).
No problem is observed with 4 threads for other choices of the parameters. In particular (cf. "Further details on predict_rating function" below), when I consider all the similar items for the weighted average and I scan the rating vector and search in the similarity vector (the opposite of what I normally do), the execution time scales well on 4 threads.

Further details on predict_rating function: This function works in the following way. The smallest between item_rating_vector and item_similarity_vector is scanned linearly and I do a binary search on the longest of the two. If the rating/similarity is positive, it is considered in the weighted average.

double predict_rating (dictionary<int, double> &item_rating_vector,
                   dictionary<int, double> &item_similarity_vector)

{
size_t size_item_rating_vector = item_rating_vector.size();
size_t size_item_similarity_vector = item_similarity_vector.size();

if (size_item_rating_vector == 0 || size_item_similarity_vector == 0)
    return 0.0;
else
{
    double s, r, sum_s = 0.0, sum_sr = 0.0;
    int temp_item = 0;

    if (size_item_rating_vector < size_item_similarity_vector)
    {
        // Scan item_rating_vector and search in item_similarity_vector
        for (dictionary<int,double>::const_iterator iter = item_rating_vector.begin();
             iter != item_rating_vector.end();
             ++iter)
        {
            // scan the rating vector forwards: iterate until the whole vector has
            // been scanned.
            temp_item = (*iter).get_key();

            // Retrieve rating that user gave to temp_item (0.0 if not given)
            try { s = item_similarity_vector[temp_item]; }
            catch (const std::out_of_range &e) { s = 0.0; }

            if (s > 0.0)
            {
                // temp_item is positively similar to the target item. consider it in the average

                // Retrieve rating that the user gave to temp_item
                r = (*iter).get_value();

                // increment the sums
                sum_s += s;
                sum_sr += s * r;
            }
        }
    }
    else
    {
        // Scan item_similarity_vector and search in item_rating_vector
        for (dictionary<int,double>::const_iterator iter = item_similarity_vector.begin();
             iter != item_similarity_vector.end();
             ++iter)
        {
            // scan the rating vector forwards: iterate until the whole vector has
            // been scanned.
            temp_item = (*iter).get_key();

            s = (*iter).get_value();

            if (!(s > 0.0))
                continue;

            // Retrieve rating that user gave to temp_item (0.0 if not given)
            try { r = item_rating_vector[temp_item]; }
            catch (const std::out_of_range &e) { r = 0.0; }

            if (r > 0.0)
            {
                // temp_item is positively similar to the target item: increment the sums
                sum_s += s;
                sum_sr += s * r;
            }
        }
    }

    if (sum_s > 0.0)
        return sum_sr / sum_s;
    else
        return 0.0;
}

}

Further details on the hardware: I am running this program on a dell XPS15 with a quad-core i7 processor and 16Gb RAM. I execute the code on a linux virtualbox (I set the VM to use 4 processors and 4Gb RAM).

Thank in advance,

Pierpaolo

Solution

Well I found the solution to the problem myself: I post the explanation for the community. In the predict_rating function I used try/catch for handling out_of_range errors thrown by my dictionary structure when a key that is not contained in the dictionary is searched. I read on Are exceptions in C++ really slow that exception handling is computationally heavy in the case an exception is thrown. In my case, for each call of predict_rating I had multiple out_of_range error thrown and handled. I simply removed the try/catch block and wrote a function that searches in the dictionary and return a default value if that key does not exist. This modification produced a speedup of around 2000x and now the program scales well with respect to the number of threads even on the VM.

Thanks to all of you and if you have other suggestions don't hesitate!

Pierpaolo