Search code examples
c++multithreadinglibc++ostringstream

Why is locale causing std::ostringstream to get slower as I use more threads?


I'm building some formatted strings using std::ostringstream. When running on a single thread, code profiling shows no bottle neck caused by std::ostringstream.

Code Profiling with a single thread

When I start using more threads, std::ostringstream slows down due to std::__1::locale::locale.

Code Profiling with two threads

This gets worse and worse as more threads are used.

Code Profiling with three threads

I'm not performing any thread synchronization explicitly but I suspect something inside std::__1::locale::locale is causing my threads to block which gets worse as I use more threads. It's the difference between a single thread taking ~30 seconds and 10 threads taking 10 minutes.

The code is in question is small but called many times,

static std::string to_string(const T d) {
    std::ostringstream stream;
    stream << d;

    return stream.str();
}

When I change it to avoid constructing a new std::ostringstream every time,

thread_local static std::ostringstream stream;
const std::string clear;

static std::string to_string(const T d) {
    stream.str(clear);
    stream << d;

    return stream.str();
}

I recover multithreaded performance but single thread performance suffers. What can I do to avoid this problem? The strings built here never need to be human readable. They are only used so that I can work around the lack of a hash function for std::complex. Is there away to avoid localization when building formatted strings?

#include <map>
#include <sstream>
#include <complex>
#include <iostream>
#include <thread>
#include <chrono>

thread_local std::map<std::string, void *> cache;

int main(int argc, const char * argv[]) {
    for (size_t i = 1; i <= 10; i++) {
        const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
        std::vector<std::thread> threads(i);
        for (auto &t : threads) {
            t = std::thread([] () -> void {
                for (size_t j = 0; j < 1000000; j++) {
                    std::ostringstream stream;
                    stream << std::complex<double> (static_cast<double> (j));
                    cache[stream.str()] = reinterpret_cast<void *> (&j);
                }
            });
        }
        for (auto &t : threads) {
            t.join();
        }
        
        const std::chrono::high_resolution_clock::time_point end =
                  std::chrono::high_resolution_clock::now();
        const auto total_time = end - start;
        const std::chrono::nanoseconds total_time_ns =
                  std::chrono::duration_cast<std::chrono::nanoseconds> (total_time);

        if (total_time_ns.count() < 1000) {
            std::cout << total_time_ns.count()               << " ns"  << std::endl;
        } else if (total_time_ns.count() < 1000000) {
            std::cout << total_time_ns.count()/1000.0        << " μs"  << std::endl;
        } else if (total_time_ns.count() < 1000000000) {
            std::cout << total_time_ns.count()/1000000.0     << " ms"  << std::endl;
        } else if (total_time_ns.count() < 60000000000) {
            std::cout << total_time_ns.count()/1000000000.0  << " s"   << std::endl;
        } else if (total_time_ns.count() < 3600000000000) {
            std::cout << total_time_ns.count()/60000000000.0 << " min" << std::endl;
        } else {
            std::cout << total_time_ns.count()/3600000000000 << " h"   << std::endl;
        }
        std::cout << std::endl;
    }

    return 0;
}

Running on an 10 core (8 performance, 2 efficiency)Apple M1 produces the output. Build setting are using the standard Xcode defaults. For a debug build the timings are

3.90096 s

4.15853 s

4.48616 s

4.843 s

6.15202 s

8.14986 s

10.6319 s

12.7732 s

16.7492 s

19.9288 s

For a Release build, the timings are

844.28 ms

1.23803 s

2.05088 s

3.39994 s

7.43743 s

9.53968 s

11.2953 s

12.6878 s

20.3917 s

24.1944 s

Solution

  • Doing some digging for alternatives, the std::to_string notes

    std::to_string relies on the current locale for formatting purposes, and therefore concurrent calls to std::to_string from multiple threads may result in partial serialization of calls. C++17 provides std::to_chars as a higher-performance locale-independent alternative.

    Using std::to_chars in the minimum example instead results in much better performance to what I was expecting for an embarrassingly parallel code.

    #include <map>
    #include <sstream>
    #include <complex>
    #include <iostream>
    #include <thread>
    #include <chrono>
    #include <charconv>
    #include <limits>
    #include <string>
    #include <iomanip>
    
    thread_local std::map<std::string, void *> cache;
    thread_local std::map<std::string, void *> cache2;
    
    void stream() {
        for (size_t i = 1; i <= 10; i++) {
            const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
            std::vector<std::thread> threads(i);
            for (auto &t : threads) {
                t = std::thread([] () -> void {
                    for (size_t j = 0; j < 1000000; j++) {
                        std::ostringstream stream;
                        stream << std::setprecision(std::numeric_limits<double>::max_digits10);
                        stream << std::complex<double> (static_cast<double> (j));
                        cache[stream.str()] = reinterpret_cast<void *> (&j);
                    }
                });
            }
            for (auto &t : threads) {
                t.join();
            }
            
            const std::chrono::high_resolution_clock::time_point end =
                      std::chrono::high_resolution_clock::now();
            const auto total_time = end - start;
            const std::chrono::nanoseconds total_time_ns =
                      std::chrono::duration_cast<std::chrono::nanoseconds> (total_time);
    
            if (total_time_ns.count() < 1000) {
                std::cout << total_time_ns.count()               << " ns"  << std::endl;
            } else if (total_time_ns.count() < 1000000) {
                std::cout << total_time_ns.count()/1000.0        << " μs"  << std::endl;
            } else if (total_time_ns.count() < 1000000000) {
                std::cout << total_time_ns.count()/1000000.0     << " ms"  << std::endl;
            } else if (total_time_ns.count() < 60000000000) {
                std::cout << total_time_ns.count()/1000000000.0  << " s"   << std::endl;
            } else if (total_time_ns.count() < 3600000000000) {
                std::cout << total_time_ns.count()/60000000000.0 << " min" << std::endl;
            } else {
                std::cout << total_time_ns.count()/3600000000000 << " h"   << std::endl;
            }
            std::cout << std::endl;
        }
    }
    
    void to_chars() {
        for (size_t i = 1; i <= 10; i++) {
            const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
            std::vector<std::thread> threads(i);
            const size_t max_digits = std::numeric_limits<double>::max_digits10;
            for (size_t k = 0, ke = threads.size(); k < ke; k++) {
                threads[k] = std::thread([] () -> void {
                    std::array<char, 36> buffer;
                    for (size_t j = 0; j < 1000000; j++) {
                        char *end = std::to_chars(buffer.begin(), buffer.end(), static_cast<double> (j),
                                                  std::chars_format::general, max_digits).ptr;
                        cache2[std::string(buffer.data(), end)] = reinterpret_cast<void *> (&j);
                    }
                });
            }
            for (auto &t : threads) {
                t.join();
            }
            
            const std::chrono::high_resolution_clock::time_point end =
                      std::chrono::high_resolution_clock::now();
            const auto total_time = end - start;
            const std::chrono::nanoseconds total_time_ns =
                      std::chrono::duration_cast<std::chrono::nanoseconds> (total_time);
    
            if (total_time_ns.count() < 1000) {
                std::cout << total_time_ns.count()               << " ns"  << std::endl;
            } else if (total_time_ns.count() < 1000000) {
                std::cout << total_time_ns.count()/1000.0        << " μs"  << std::endl;
            } else if (total_time_ns.count() < 1000000000) {
                std::cout << total_time_ns.count()/1000000.0     << " ms"  << std::endl;
            } else if (total_time_ns.count() < 60000000000) {
                std::cout << total_time_ns.count()/1000000000.0  << " s"   << std::endl;
            } else if (total_time_ns.count() < 3600000000000) {
                std::cout << total_time_ns.count()/60000000000.0 << " min" << std::endl;
            } else {
                std::cout << total_time_ns.count()/3600000000000 << " h"   << std::endl;
            }
            std::cout << std::endl;
        }
    }
    
    int main(int argc, const char * argv[]) {
        stream();
        std::cout << "-----------------------------------------------------------" << std::endl;
        to_chars();
        return 0;
    }
    

    Results in timings of

    854.078 ms
    
    1.3472 s
    
    2.26556 s
    
    3.61298 s
    
    7.55469 s
    
    9.29697 s
    
    11.321 s
    
    12.6926 s
    
    19.607 s
    
    24.4866 s
    
    -----------------------------------------------------------
    403.037 ms
    
    416.532 ms
    
    432.433 ms
    
    437.869 ms
    
    450.775 ms
    
    458.693 ms
    
    473.683 ms
    
    498.53 ms
    
    528.434 ms
    
    560.239 ms
    

    Code profiling confirms the may string hashes are no longer the largest bottleneck.

    Code profiling using std::to_char and 10 threads