Search code examples
c++c++17std-ranges

Why is implementation using `std::view` and PSTL slower?


I just learned about PSTL recently and I am experimenting with it. I don't understand why the following codes behave so differently. Here are the codes:


// a.cpp
#include <algorithm>
#include <cstring>
#include <iostream>
#include <ranges>
#include <vector>

const long n = 1000;
long dp[n][n];

int main() {
  // init
  for (long i = 0; i < n; i++) {
    dp[i][0] = dp[0][i] = 1;
  }
  // dp, dp[i][j] = max((dp[i - 1][k] + dp[i - 1][n - 1 - k]) % 1000, k < j)
  for (long i = 1; i < n; i++) {
    for (long j = 1; j < n; j++) {
      dp[i][j] = -1;
      for (long k = 0; k < j; k++) {
        dp[i][j] = std::max(dp[i][j], (dp[i - 1][k] + dp[i - 1][n - 1 - k]) % 1000);
      }
    }
  }
  printf("res = %ld\n", dp[n - 1][n - 1]);
}
// b.cpp
#include <algorithm>
#include <cstring>
#include <iostream>
#include <ranges>
#include <vector>
#include <execution>

const long n = 1000;
long dp[n][n];

int main() {
  // init
  for (long i = 0; i < n; i++) {
    dp[i][0] = dp[0][i] = 1;
  }
  // dp, dp[i][j] = max((dp[i - 1][k] + dp[i - 1][n - 1 - k]) % 1000, k < j)
  for (long i = 1; i < n; i++) {
    for (long j = 1; j < n; j++) {
      auto view = std::views::iota(0, j)
        | std::views::transform([&i](long k) { 
            return (dp[i - 1][k] + dp[i - 1][n - 1 - k]) % 1000;
          })
        | std::views::common;
      auto it = std::max_element(std::execution::par, view.begin(), view.end());
      dp[i][j] = *it;
    }
  }
  printf("res = %ld\n", dp[n - 1][n - 1]);
}

And then I compile and run both of them:

➜  ~ g++ -std=c++2a -Ofast -march=native a.cpp -o a
➜  ~ g++ -std=c++2a -Ofast -march=native b.cpp -o b
➜  ~ time ./a
res = 998
./a  0.79s user 0.01s system 99% cpu 0.804 total
➜  ~ time ./b
res = 998
./b  3.22s user 0.00s system 99% cpu 3.221 total

Even more surprisingly, -O2 is faster than -O3:

➜  ~ g++ -Wall -std=c++2a -O2 -march=native b.cpp -o b -ltbb && time ./b
res = 998
./b  1.13s user 0.00s system 99% cpu 1.134 total
➜  ~ g++ -Wall -std=c++2a -O3 -march=native b.cpp -o b -ltbb && time ./b
res = 998
./b  3.18s user 0.00s system 99% cpu 3.180 total

Compiler version (I know it's old):

➜  ~ g++ --version                                 
g++ (Debian 10.2.1-6) 10.2.1 20210110
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I definitely did not expect such drastic timing different but in the "opposite way", since b.cpp uses the parallel policy std::execution::par. I also looked at the assembly code, and it seems that b.cpp doesn't even call std::max_element anymore (see here), so I guess it means it's optimised out. Then how is it still slower?

Can someone provide insight on what I am doing wrong, and how I should "fix" it, in a way that still uses parallelised STL?

Thanks!


Solution

  • I got a satisfying answer that might help people in the future that find this question.

    Firstly, I did not use -ltbb. It is required to enable parallelisation.

    Also, the timing shows "99% cpu", which indicates that nothing is parallelised. Indeed, looking at the godbolt decompilation output, we see that all the functions are inlined, which matches my observation ("I also looked at the assembly code, and it seems that b.cpp doesn't even call std::max_element anymore"). However, it doesn't mean it's more optimised than a.cpp.

    Finally, a better solution here is to instead parallelise the j loop. Note that for a fixed i, the j loop can be run in parallel without write data race (of course they can read from dp[i - 1][*] simultaneously).

    Instead of comparing and writing with dp[i][j] directly in the std::max call, which will require reading and writing to it, I instead use a local variable that's probably stored as a register or something.

    I also switched back to using omp.h after the discussion in the comments that STL might be slow. To be fair, it matches my usual observation too. Even std::deque is slower than a shitty one I write myself.

    #include <algorithm>
    #include <cstring>
    #include <iostream>
    #include <vector>
    #include <omp.h>
    
    // compile with -fopenmp
    
    const long r = 50;
    const long n = 5000;
    long dp[r][n];
    
    int main() {
      // init
      for (long i = 0; i < n; i++) {
        dp[0][i] = 1;
      }
      // dp, dp[i][j] = max((dp[i - 1][k] + dp[i - 1][n - 1 - k]) % 1000, k < j)
      for (long i = 1; i < r; i++) {
        dp[i][0] = 1;
        #pragma omp parallel for
        for (long j = 1; j < n; j++) {
          long max = -1;
          for (long k = 0; k < j; k++) {
            max = std::max(max, (dp[i - 1][k] + dp[i - 1][n - 1 - k]) % 1000);
          }
          dp[i][j] = max;
        }
      }
      printf("res = %ld\n", dp[r - 1][n - 1]);
    }
    

    Timing:

    ➜  ~ g++ -Wall -std=c++2a -O2 -march=native b.cpp -o b -fopenmp && time ./b
    res = 998
    ./b  2.56s user 0.01s system 910% cpu 0.283 total
    

    As you can see, 887% cpu is used. For n = 20000 it only takes 3 seconds.