Search code examples
c++algorithmmergec++17gnu

How to efficiently merge k sorted pairwise key/value vectors by keys?


I want to merge k sorted pairwise key/value vectors by keys. Typically, the size n of the vectors is very large (e.g., n >= 4,000,000,000).

Consider the following example for k = 2:

// Input
keys_1 = [1, 2, 3, 4], values_1 = [11, 12, 13, 14]
keys_2 = [3, 4, 5, 6], values_2 = [23, 24, 25, 26]

// Output
merged_keys = [1, 2, 3, 3, 4, 4, 5, 6], merged_values = [11, 12, 13, 23, 14, 24, 25, 26]

Since __gnu_parallel::multiway_merge is a highly efficient k-way merge algorithm, I tried to utilize a state-of-the-art zip iterator (https://github.com/dpellegr/ZipIterator) to "combine" the key-value pair vectors.

#include <iostream>
#include <vector>
#include <parallel/algorithm>

#include "ZipIterator.hpp"

int main(int argc, char* argv[]) {
  std::vector<int> keys_1   = {1, 2, 3, 4};
  std::vector<int> values_1 = {11, 12, 13, 14};
  std::vector<int> keys_2   = {3, 4, 5, 6};
  std::vector<int> values_2 = {23, 24, 25, 26};

  std::vector<int> merged_keys(8);
  std::vector<int> merged_values(8);

  auto kv_it_1 = Zip(keys_1, values_1);
  auto kv_it_2 = Zip(keys_2, values_2);
  auto mkv_it = Zip(merged_keys, merged_values);

  auto it_pairs = {std::make_pair(kv_it_1.begin(), kv_it_1.end()),
                   std::make_pair(kv_it_2.begin(), kv_it_2.end())};

  __gnu_parallel::multiway_merge(it_pairs.begin(), it_pairs.end(), mkv_it.begin(), 8, std::less<>());
  
  for (size_t i = 0; i < 8; ++i) {
    std::cout << merged_keys[i] << ":" << merged_values[i] << (i == 7 ? "\n" : ", ");
  }

  return 0;
}

However, I get various compilation errors (building with -O3):

error: cannot bind non-const lvalue reference of type' std::__iterator_traits<ZipIter<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > > >, void>::value_type&' {aka 'std::tuple<int, int>&'} to an rvalue of type' std::tuple<int, int>'

error: cannot convert ‘ZipIter<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > > >::reference*’ {aka ‘ZipRef<int, int>’} to ‘_ValueType’ {aka ‘std::tuple<int, int>*’}

Is it possible to modify the ZipIterator to make it work?

Or is there a more efficient way of merging k sorted pairwise key/value vectors by keys?

Considered Alternatives

  1. Define a KeyValuePair struct with int key and int value members as well as operator< and operator<= operators. Move the elements of the key/value vectors into std::vector<KeyValuePair>s. Call __gnu_parallel::multiway_merge on the std::vector<KeyValuePair>s. Move the merged elements back into the key/value vectors. [Verdict: slow execution, high memory overhead, even with -O3]
  2. Use std::merge(std::execution::par_unseq, kv_it_1.begin(), kv_it_1.end(), kv_it_2.begin(), kv_it_2.end(), mkv_it.begin()); instead of __gnu_parallel::multiway_merge. [Verdict: supports only two key/value vectors]

Solution

  • Is it possible to modify the ZipIterator to make it work?

    Yes, but it would require patching __gnu_parallel::multiway_merge. The source of error is this line:

          /** @brief Dereference operator.
          *  @return Referenced element. */
          typename std::iterator_traits<_RAIter>::value_type&
          operator*() const
          { return *_M_current; }
    

    This is a member function of _GuardedIterator - an auxiliary structure used in the multiway_merge implementation. It wraps _RAIter class which in your case is ZipIter. By definition, when an iterator is dereferenced (*_M_current), the type of the returned expression is supposed to be reference type. However, this code expects it to be value_type&. In most cases, these are the same types. Indeed, when you dereference an item you expect to get a reference to this very item. However, it is impossible to do with a zip iterator, because its elements are virtual, they are created on the fly. That's why reference type of ZipIter is not a reference type at all, it is actually a value type called ZipRef:

      using reference = ZipRef<std::remove_reference_t<typename std::iterator_traits<IT>::reference>...>;
    

    Kind of the same practice that is used with (much hated) vector<bool>.

    So, there is no problem with ZipIterator, or with how you use the algorithm, it is a non-trivial requirement for the algorithm itself. The next question is, can we get rid of it?

    And the answer is yes. You can change _GuardedIterator::operator*() to return reference instead of value_type&. Then you will have a compile error in this line:

          // Default value for potentially non-default-constructible types.
          _ValueType* __arbitrary_element = 0;
    
          for (_SeqNumber __t = 0; __t < __k; ++__t)
            {
              if(!__arbitrary_element
                 && _GLIBCXX_PARALLEL_LENGTH(__seqs_begin[__t]) > 0)
                __arbitrary_element = &(*__seqs_begin[__t].first);
            }
    

    Here the address of an element is taken for some __arbitrary_element. We can store a copy of this element instead since we know ZipRef is cheap to copy and it is default-constructible:

          // Local copy of the element
          _ValueType __arbitrary_element_val;
          _ValueType* __arbitrary_element = 0;
    
          for (_SeqNumber __t = 0; __t < __k; ++__t)
            {
              if(!__arbitrary_element
                 && _GLIBCXX_PARALLEL_LENGTH(__seqs_begin[__t]) > 0) {
                __arbitrary_element_val = *__seqs_begin[__t].first;
                __arbitrary_element = &__arbitrary_element_val;
              }
            }
    

    The same errors will appear in several places in the file multiseq_selection.h, e.g. here and here. Fix all of them using the similar technique.

    Then you will see multiple errors like this one:

    ./parallel/multiway_merge.h:879:29: error: passing ‘const ZipIter<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > > >’ as ‘this’ argument discards qualifiers [-fpermissive]
    

    They are about const incorrectness. They are due to the fact that you declared it_pairs as auto, which in this particular scenario deduced the type to be std::inializer_list. This is a very peculiar type. For instance, it provides only constant access to its members, even though it itself is not declared const. That's the source of these errors. Change auto to e.g. std::vector and these errors are gone.

    It should compile find at this point. Just don't forget to compile with -fopenmp or you will get "undefined reference to `omp_get_thread_num'" error.

    Here is the output that I see:

    $ ./a.out
    1:11, 2:12, 3:13, 3:23, 4:14, 4:24, 5:25, 6:26