c++performance parallel-processing parallel.foreach

Parallel on C++ isn't fast like C#

I have code like this on C#

private static Random random = new Random();
public static string RandomString(int length)
{
    const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    return new string(Enumerable.Repeat(chars, length)
      .Select(s => s[random.Next(s.Length)]).ToArray());
}
     Task.Factory.StartNew(() => {       
                 System.Threading.Tasks.Parallel.For(0L, 10000000, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 }, n =>
                 {
                 Console.WirteLine(RandomString(12));
                 });
                }

Add parallel method to it, it manage to run generate 10 millions random string in less than 8 seconds & use all CPU power

I tried to do it again in C++

string NonRepeatChar(int max_len)
{
    std::string valid_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";

    std::random_device rd;
    std::mt19937 g(rd());
    std::shuffle(valid_chars.begin(), valid_chars.end(), g);
    std::string rand_str(valid_chars.begin(), valid_chars.begin() + max_len);
    return rand_str;
}

Applied the code to some recommend C++ parallel method

void multiply()
{
for (size_t i = 0; i < 10; i++)
{
    for (size_t j = 0; j < 10; j++)
    {           
        for (int k = 0; k < 10; k++)
        {
            printf("%s\n",NonRepeatChar(10));
        }           
    }
}
}

class Foo {
public:
    Foo() : counter_(0) {}

    std::pair<string, std::future<void>> a_function(std::future<void>& f) {
        // Ensure that the background task from the previous iteration
        // has completed
        f.wait();

        // Set the task for the next iteration
        std::future<void> fut = std::async(std::launch::async, &Foo::background_task, this);

        // Do some work
        string value = NonRepeatChar(12);

        // Return the result and the future for the next iteration
        return std::make_pair(value.c_str(), std::move(fut));
    }

    void background_task() {
        ++counter_;
    }

private:
    std::atomic<int> counter_;
};

Record the time when run it

int main()
{   
    clock_t tStart = clock();
    std::future<void> bleak = std::async(std::launch::deferred, []() {});

    Foo foo;

    for (size_t i = 0; i < 10000; ++i) {
        // Call the function
        std::pair<string, std::future<void>> result = foo.a_function(bleak);    
        bleak = std::move(result.second);   
        std::cout << result.first << "\n";
    }
    printf("Time taken: %.2fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
    return 0;
    }

Here're my result:

10.98s//normal loop

8.76s//multiply

8.88s//Foo

Obviously the code didn't have any difference compare to original loop, only generate 10000 line, & it didn't even use all CPU power like C#. Is there something wrong with the parallel method? How may I optimize it?

Solution

This is a simple single-threaded example of what can be done with a hybrid C/C++ approach. Game developers use methods which are hybrids of "formal" C++ code and look less like python. Of course, like Marmite, you love it or hate it, but regardless, the results speak for themselves.

I apologise if this is more to learn that you thought.

This particular example generates 10M strings in 3.682s on a single thread on my old AMD box. You can launch a small number of async workers (< std::thread::hardware_concurrency()) to carve the work up into chunks of around a million cycles. You would then have synchronisation issues with your I/O, so be careful and avoid mutexes!

To go much faster, you need to unroll loops manually, use SIMD arithmetic and so on. For example, this case would work well with SIMD permute vectors.

#include <stdint.h>
#include <stdio.h>

// This is typical of the random number generators used in professional games.
// It is less "correct" than mersenne twisters, for example, but much faster.
inline uint32_t fast_rand(int32_t &seed, uint32_t limit) {
  // Prevent infinite loops.
  //if (limit == 0) return 0;

  // Make a mask that has all 1s in the bottom few bits.
  // This reduces the number of iterations of the loop to ~1
  int leading_zeros = __builtin_clz(limit);
  int mask = 0xffffffff >> leading_zeros;

  // Loop until our result is in range using rotate and xor.
  do {
    seed = (seed << 1) ^ ((seed >> 31) & 0xa53a9be9);
  } while ((seed & mask) >= limit);

  return seed & mask;
}

int main() {
  // I'm using two seeds to prevent coupling.
  // On their own, their quantiles are pretty flat, but
  // in this example they couple, causing conditioning in the results.
  int32_t length_seed = (int32_t)0x95abcfad;
  int32_t swap_seed = (int32_t)0xba7235fab;

  for (int i = 0; i != 10000000; ++i) {
    // Note we don't use a std::string. These are very slow.
    char chars[] = 
      "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
    auto num_chars = sizeof(chars) - 1;
    auto length = fast_rand(length_seed, num_chars-1) + 1;

    // Trim the string to the right length.
    chars[length] = 0;

    // Shuffle the characters.
    for (int j = 0; j != length; ++j) {
      int swapper = j + fast_rand(swap_seed, length - j);
      auto tmp = chars[j];
      chars[j] = chars[swapper];
      chars[swapper] = tmp;
    }

    // Print with puts (not iostreams).
    puts(chars);
  }
}

For "hot loop" examples like this you should check your result on godbolt or similar.

clang with -O3 -mlzcnt gives the following inner loop.

.LBB0_4:                                #   Parent Loop BB0_1 Depth=1
    mov     rsi, rax
    sub     rsi, rdx
    lzcnt   ecx, esi
    mov     edi, -1
    shr     edi, cl
.LBB0_5:                                #   Parent Loop BB0_1 Depth=1
    lea     ecx, [rbx + rbx]
    sar     ebx, 31
    and     ebx, -1522885655
    xor     ebx, ecx
    mov     ecx, ebx
    and     ecx, edi
    cmp     rsi, rcx
    jbe     .LBB0_5
    add     ecx, edx
    mov     sil, byte ptr [rsp + rdx]
    movsxd  rdi, ecx
    mov     cl, byte ptr [rsp + rdi]
    mov     byte ptr [rsp + rdx], cl
    mov     byte ptr [rsp + rdi], sil
    add     rdx, 1
    cmp     rdx, rax
    jne     .LBB0_4