I have code like this on C#
private static Random random = new Random();
public static string RandomString(int length)
{
const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
return new string(Enumerable.Repeat(chars, length)
.Select(s => s[random.Next(s.Length)]).ToArray());
}
Task.Factory.StartNew(() => {
System.Threading.Tasks.Parallel.For(0L, 10000000, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount * 10 }, n =>
{
Console.WirteLine(RandomString(12));
});
}
Add parallel method to it, it manage to run generate 10 millions random string in less than 8 seconds & use all CPU power
I tried to do it again in C++
string NonRepeatChar(int max_len)
{
std::string valid_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
std::random_device rd;
std::mt19937 g(rd());
std::shuffle(valid_chars.begin(), valid_chars.end(), g);
std::string rand_str(valid_chars.begin(), valid_chars.begin() + max_len);
return rand_str;
}
Applied the code to some recommend C++ parallel method
void multiply()
{
for (size_t i = 0; i < 10; i++)
{
for (size_t j = 0; j < 10; j++)
{
for (int k = 0; k < 10; k++)
{
printf("%s\n",NonRepeatChar(10));
}
}
}
}
class Foo {
public:
Foo() : counter_(0) {}
std::pair<string, std::future<void>> a_function(std::future<void>& f) {
// Ensure that the background task from the previous iteration
// has completed
f.wait();
// Set the task for the next iteration
std::future<void> fut = std::async(std::launch::async, &Foo::background_task, this);
// Do some work
string value = NonRepeatChar(12);
// Return the result and the future for the next iteration
return std::make_pair(value.c_str(), std::move(fut));
}
void background_task() {
++counter_;
}
private:
std::atomic<int> counter_;
};
Record the time when run it
int main()
{
clock_t tStart = clock();
std::future<void> bleak = std::async(std::launch::deferred, []() {});
Foo foo;
for (size_t i = 0; i < 10000; ++i) {
// Call the function
std::pair<string, std::future<void>> result = foo.a_function(bleak);
bleak = std::move(result.second);
std::cout << result.first << "\n";
}
printf("Time taken: %.2fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
return 0;
}
Here're my result:
10.98s//normal loop
8.76s//multiply
8.88s//Foo
Obviously the code didn't have any difference compare to original loop, only generate 10000 line, & it didn't even use all CPU power like C#. Is there something wrong with the parallel method? How may I optimize it?
This is a simple single-threaded example of what can be done with a hybrid C/C++ approach. Game developers use methods which are hybrids of "formal" C++ code and look less like python. Of course, like Marmite, you love it or hate it, but regardless, the results speak for themselves.
I apologise if this is more to learn that you thought.
This particular example generates 10M strings in 3.682s on a single thread on my old AMD box. You can launch a small number of async workers (< std::thread::hardware_concurrency()) to carve the work up into chunks of around a million cycles. You would then have synchronisation issues with your I/O, so be careful and avoid mutexes!
To go much faster, you need to unroll loops manually, use SIMD arithmetic and so on. For example, this case would work well with SIMD permute vectors.
#include <stdint.h>
#include <stdio.h>
// This is typical of the random number generators used in professional games.
// It is less "correct" than mersenne twisters, for example, but much faster.
inline uint32_t fast_rand(int32_t &seed, uint32_t limit) {
// Prevent infinite loops.
//if (limit == 0) return 0;
// Make a mask that has all 1s in the bottom few bits.
// This reduces the number of iterations of the loop to ~1
int leading_zeros = __builtin_clz(limit);
int mask = 0xffffffff >> leading_zeros;
// Loop until our result is in range using rotate and xor.
do {
seed = (seed << 1) ^ ((seed >> 31) & 0xa53a9be9);
} while ((seed & mask) >= limit);
return seed & mask;
}
int main() {
// I'm using two seeds to prevent coupling.
// On their own, their quantiles are pretty flat, but
// in this example they couple, causing conditioning in the results.
int32_t length_seed = (int32_t)0x95abcfad;
int32_t swap_seed = (int32_t)0xba7235fab;
for (int i = 0; i != 10000000; ++i) {
// Note we don't use a std::string. These are very slow.
char chars[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
auto num_chars = sizeof(chars) - 1;
auto length = fast_rand(length_seed, num_chars-1) + 1;
// Trim the string to the right length.
chars[length] = 0;
// Shuffle the characters.
for (int j = 0; j != length; ++j) {
int swapper = j + fast_rand(swap_seed, length - j);
auto tmp = chars[j];
chars[j] = chars[swapper];
chars[swapper] = tmp;
}
// Print with puts (not iostreams).
puts(chars);
}
}
For "hot loop" examples like this you should check your result on godbolt or similar.
clang with -O3 -mlzcnt gives the following inner loop.
.LBB0_4: # Parent Loop BB0_1 Depth=1
mov rsi, rax
sub rsi, rdx
lzcnt ecx, esi
mov edi, -1
shr edi, cl
.LBB0_5: # Parent Loop BB0_1 Depth=1
lea ecx, [rbx + rbx]
sar ebx, 31
and ebx, -1522885655
xor ebx, ecx
mov ecx, ebx
and ecx, edi
cmp rsi, rcx
jbe .LBB0_5
add ecx, edx
mov sil, byte ptr [rsp + rdx]
movsxd rdi, ecx
mov cl, byte ptr [rsp + rdi]
mov byte ptr [rsp + rdx], cl
mov byte ptr [rsp + rdi], sil
add rdx, 1
cmp rdx, rax
jne .LBB0_4