Why 8 threads is slower than 2 threads?

I have to apologize for my poor English first. I'm learning hardware transactional memory now and I'm using the spin_rw_mutex.h in TBB to implement the transaction block in C++. speculative_spin_rw_mutex is a class in the spin_rw_mutex.h is a mutex which have already implemented the RTM interface of intel TSX.

The example I used to test RTM is very simple. I created the Account class and I transfer money from one account to another randomly. All accounts are in an accounts array and the size is 100. The random function is in boost.(I think STL has the same random function). The transfer function is protected with the speculative_spin_rw_mutex. I used tbb::parallel_for and tbb::task_scheduler_init to control concurrency. All transfer methods are called in the lambda of paraller_for. The total transfer times is 1 million. The strange thing is when the task_scheduler_init is set as 2 the program is the fastest (8 seconds). In fact my CPU is i7 6700k which has 8 threads. In the range of 8 and 50,000, the performance of the program is almost no change (11 to 12 seconds). When I increase the task_scheduler_init to 100,000, the run time will increase to about 18 seconds. I tried to use profiler to analyze the program and I found the hotspot function is the mutex. However I think the rate of transaction roll-back is not so high. I don't know why the program is so slow.

Somebody says that the false sharing slows down the performance, as a result, I tried to use

std::vector> cache_aligned_accounts(AccountsSIZE,Account(1000));

to replace the orignal array

Account* accounts[AccountsSIZE];

to avoid the false sharing. It seems nothing changed; Here is my new codes.



#include <tbb/spin_rw_mutex.h>
#include <iostream>
#include "tbb/task_scheduler_init.h"  
#include "tbb/task.h"
#include "boost/random.hpp"
#include <ctime>
#include <tbb/parallel_for.h>
#include <tbb/spin_mutex.h>
#include <tbb/cache_aligned_allocator.h>
#include <vector>
using namespace tbb;
tbb::speculative_spin_rw_mutex mu;

class Account {
private:
    int balance;
public:
    Account(int ba) {
        balance = ba;
    }
    int getBalance() {
        return balance;
    }
    void setBalance(int ba) {
        balance = ba;
    }
};

//Transfer function. Using speculative_spin_mutex to set critical section
void transfer(Account &from, Account &to, int amount) {
    speculative_spin_rw_mutex::scoped_lock lock(mu);
    if ((from.getBalance())<amount)
    {
        throw std::invalid_argument("Illegal amount!");
    }
    else {
        from.setBalance((from.getBalance()) - amount);
        to.setBalance((to.getBalance()) + amount);
    }
}

const int AccountsSIZE = 100;

//Random number generater and distributer
boost::random::mt19937 gener(time(0));
boost::random::uniform_int_distribution<> distIndex(0, AccountsSIZE - 1);
boost::random::uniform_int_distribution<> distAmount(1, 1000);
/*
Function of transfer money
*/
void all_transfer_task() {
    task_scheduler_init init(10000);//Set the number of tasks can be run together
    /*
    Initial accounts, using cache_aligned_allocator to avoid false sharing
    */
    std::vector<Account, cache_aligned_allocator<Account>> cache_aligned_accounts(AccountsSIZE,Account(1000));

    const int TransferTIMES = 10000000;
    //All transfer tasks
    parallel_for(0, TransferTIMES, 1, [&](int i) {

        try {
            transfer(cache_aligned_accounts[distIndex(gener)], cache_aligned_accounts[distIndex(gener)], distAmount(gener));
        }
        catch (const std::exception& e)
        {
            //cerr << e.what() << endl;
        }
        //std::cout << distIndex(gener) << std::endl;
    });

    std::cout << cache_aligned_accounts[0].getBalance() << std::endl;

    int total_balance = 0;
    for (size_t i = 0; i < AccountsSIZE; i++)
    {
        total_balance += (cache_aligned_accounts[i].getBalance());
    }
    std::cout << total_balance << std::endl;
}

Solution

While I can't reproduce your benchmark, I see here two possible causes for this behavior:

"Too many cooks boil the soup": you use a single spin_rw_mutex that is locked by all the transfers by all the threads. Seems to me that your transfers execute sequentially. This would explain why the profile sees a hot point there. The Intel page warns against performance degradation in such case.
Throughput vs. speed: On an i7, in a couple of benchmarks, I could notice that when you use more cores, each core runs a little bit slower, so that overall time of fixed siez loops runs longer. However, counting the overall throughput (i.e. the total number of transactions that happen in all these parallel loops) the throughput is much higher (although not fully proportinally to the number of cores).

I'd rather opt for the first case, but the second is not to eliminate.